How to subset multiple elements of a list R - r

I am trying to remove undesired values from elements in a list. The example below is a condensed version of my attempt at solving the problem. This system is going to be made into a shiny app so it needs to be reactive to any size or cardinality of input vectors (seen below as A:B , 'group...', remove) as these will be the indirect result of a user selection.
A <- c(35,35,2609,917,0)
B <- c(8,6,9,24,27,35)
C <- c(1,45,91,24)
D <- c(927,38,22,9)
E <- c(6361,7,43)
my.list <- list(A, B, C, D, E)
group1 <- c(1,2)
group2 <- c(3,5)
remove <- c(35, 24, 6361)
my.list[group1] <- my.list[group1] %>% subset(., !.%in% remove)
my.list
###final expected output
my.list
[[1]]
[1] 2609 917 0
[[2]]
[1] 8 6 9 27
[[3]]
[1] 1 45 91 24
[[4]]
[1] 927 38 22 9
[[5]]
[1] 6361 7 43
The solution should allow for any number of input groups that specify the location of the list elements to be subset, any number of elements to the list, and any number of values to be removed. (it shouldn't be reliant on any fixed cardinality of membership)
Thanks!

my.list[group1] %<>% lapply(function(x) setdiff(x, remove))

Related

Conditional selection of elements of a list in Base R

I'm trying to find the unique elements in the variables listed as x.
The only constraint is that I want to first find the variable (here either a, b, or c) in the list whose max element is smallest, and keep that variable untouched at the top of the output?
I have tried something but can't implement the constraint above:
P.S. My goal is to achieve a function/looping structure to handle larger lists.
x = list(a = 1:5, b = 3:7, c = 6:9) ## a list of 3 variables; variable `a` has the smallest
## max among all variables in the list, so keep `a`
## untouched at the top of the output.
x[-1] <- Map(setdiff, x[-1], x[-length(x)]) ## Now, take the values of `b` not shared
## with `a`, AND values of `c` not shared
## with `b`.
x
# Output: # This output is OK now, but if we change order of `a`, `b`,
# and `c` in the initial list the output will change.
# This is why the constraint above is necessary?
$a
[1] 1 2 3 4 5
$b
[1] 6 7
$c
[1] 8 9
#Find which element in the list has smallest max.
smallest_max <- which.min(sapply(x, max))
#Rearrange the list by keeping the smallest max in first place
#followed by remaining ones
new_x <- c(x[smallest_max], x[-smallest_max])
#Apply the Map function
new_x[-1] <- Map(setdiff, new_x[-1], new_x[-length(new_x)])
new_x
#$a
#[1] 1 2 3 4 5
#$b
#[1] 6 7
#$c
#[1] 8 9
We can wrap this up in a function and then use it
keep_smallest_max <- function(x) {
smallest_max <- which.min(sapply(x, max))
new_x <- c(x[smallest_max], x[-smallest_max])
new_x[-1] <- Map(setdiff, new_x[-1], new_x[-length(new_x)])
new_x
}
keep_smallest_max(x)
#$a
#[1] 1 2 3 4 5
#$b
#[1] 6 7
#$c
#[1] 8 9

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

R list get first item of each element

This should probably be very easy for someone to answer but I have had no success on finding the answer anywhere.
I am trying to return, from a list in R, the first item of each element of the list.
> a
[1] 1 2 3
> b
[1] 11 22 33
> c
[1] 111 222 333
> d <- list(a = a,b = b,c = c)
> d
$a
[1] 1 2 3
$b
[1] 11 22 33
$c
[1] 111 222 333
Based on the construction of my list d above, I want to return a vector with three values:
return 1 11 111
sapply(d, "[[", 1) should do the trick.
A bit of explanation:
sapply: iterates over the elements in the list
[[: is the subset function. So we are asking sapply to use the subset function on each list element.
1 : is an argument passed to "[["
It turns out that "[" or "[[" can be called in a traditional manner which may help to illustrate the point:
x <- 10:1
"["(x, 3)
# [1] 8
You can do
output <- sapply(d, function(x) x[1])
If you don't need the names
names(output) <- NULL

Assigning numeric values based on the letters in a string in R

I have a data.frame that is a single column with 235,886 rows. Each row corresponds to a single word of the English language.
E.g.
> words[10000:10005,1]
[1] anticontagionist anticontagious anticonventional anticonventionalism anticonvulsive
[6] anticor
What I'd like to do is convert each row to a number based on the letters in it. So, if "a" = 1, "b" = 2, "c" = 3, and "d" = 4, then "abcd" = 10. Does anyone know of a way to do that?
My ultimate goal is to have a function that scans the data.frame for a given numeric value and returns all the strings, i.e. words, with that value. So, continuing from the example above, if I asked for the value 9, this function would return "dad" and any other rows having a numeric value of 9.
You can use a combination of strsplit and match. I've thrown a tolower in there to make sure that we are matching to the right thing.
Here's a function that implements those steps:
word_value <- function(words) {
temp <- strsplit(tolower(words), "", TRUE)
vapply(temp, function(x) sum(match(x, letters)), integer(1L))
}
Here's a sample vector:
myvec <- c("and", "dad", "cat", "fox", "mom", "add", "dan")
Test it out:
word_value(myvec)
# [1] 19 9 24 45 41 9 19
myvec[word_value(myvec) == 9]
# [1] "dad" "add"
myvec[word_value(myvec) > 20]
# [1] "cat" "fox" "mom"
You can use utf8ToInt.
#using the sample data from Ananda's answer
offset <- utf8ToInt("a") - 1
d <- vapply(tolower(myvec),
function(ii) sum(utf8ToInt(ii) - offset), FUN.VALUE = double(1L))
#and dad cat fox mom add dan
# 19 9 24 45 41 9 19
d[d > 20]
#cat fox mom
# 24 45 41
Using the offset is necessary because utf8ToInt("a") is 97, but you want "a" to be 1.
Wrapping with stack will give a different format for the output, if preferred:
d <- stack(vapply(tolower(myvec),
function(ii) sum(utf8ToInt(ii) - offset), FUN.VALUE = double(1L)))
# values ind
#1 19 and
#2 9 dad
#3 24 cat
#4 45 fox
#5 41 mom
#6 9 add
#7 19 dan
d[d$values > 20,]
# values ind
#3 24 cat
#4 45 fox
#5 41 mom

Apply function to dataframe with changing argument

I have 2 objects:
A data frame with 3 variables:
v1 <- 1:10
v2 <- 11:20
v3 <- 21:30
df <- data.frame(v1,v2,v3)
A numeric vector with 3 elements:
nv <- c(6,11,28)
I would like to compare the first variable to the first number, the second variable to the second number and so on.
which(df$v1 > nv[1])
which(df$v2 > nv[2])
which(df$v3 > nv[3])
Of course in reality my data frame has a lot more variables so manually typing each variable is not an option.
I encounter these kinds of problems quite frequently. What kind of documentation would I need to read to be fluent in these matters?
One option would be to compare with equally sized elements. For this we can replicate the elements in 'nv' each by number of rows of 'df' (rep(nv, each=nrow(df))) and compare with df or use the col function that does similar output as rep.
which(df > nv[col(df)], arr.ind=TRUE)
If you need a logical matrix that corresponds to comparison of each column with each element of 'nv'
sweep(df, 2, nv, FUN='>')
You could also use mapply:
mapply(FUN=function(x, y)which(x > y), x=df, y=nv)
#$v1
#[1] 7 8 9 10
#
#$v2
#[1] 2 3 4 5 6 7 8 9 10
#
#$v3
#[1] 9 10
I think these sorts of situations are tricky because normal looping solutions (e.g. the apply function) only loop through one object, but you need to loop both through df and nv simultaneously. One approach is to loop through the indices and to use them to grab the appropriate information from both df and nv. A convenient way to loop through indices is the sapply function:
sapply(seq_along(nv), function(x) which(df[,x] > nv[x]))
# [[1]]
# [1] 7 8 9 10
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 9 10

Resources