R - how to select elements from sublists of a list by their name - r

I have a list of lists that looks like this:
list(list("A[1]" = data.frame(W = 1:5),
"A[2]" = data.frame(X = 6:10),
B = data.frame(Y = 11:15),
C = data.frame(Z = 16:20)),
list("A[1]" = data.frame(W = 21:25),
"A[2]" = data.frame(X = 26:30),
B = data.frame(Y = 31:35),
C = data.frame(Z = 36:40)),
list("A[1]" = data.frame(W = 41:45),
"A[2]" = data.frame(X = 46:50),
B = data.frame(Y = 51:55),
C = data.frame(Z = 56:60))) -> dflist
I need my output to also be a list of list with length 3 so that each sublist retains elements whose names start with A[ while dropping other elements.
Based on some previous questions, I am trying to use this:
dflist %>%
map(keep, names(.) %in% "A[")
but that gives the following error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
Trying to select a single element, for example just A[1] like this:
dflist %>%
map(keep, names(.) %in% "A[1]")
also doesn't work. How can I achieve the desired output?

I think you want:
purrr::map(dflist, ~.[stringr::str_starts(names(.), "A\\[")])
What this does is:
For each sublist (purrr::map)
Select all elements of that sublist (.[], where . is the sublist)
Whose names start with A[ (stringr::str_starts(names(.), "A\\["))
You got the top level map correct, since you want to modify the sublists. However, map(keep, names(.) %in% "A[") has some issues:
names(.) %in% "A[" should be a function or a formula (starting with ~
purrr::keep applies the filtering function to each element of the sublist, namely to the data frames directly. It never "sees" the names of each data frame. Actually I don't think you can use keep for this problem at all
Anyway this produces:
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

If we want to use keep, use
library(dplyr)
library(purrr)
library(stringr)
map(dflist, ~ keep(.x, str_detect(names(.x), fixed("A["))))

Here a base R solution:
lapply(dflist, function(x) x[grep("A\\[",names(x))] )
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

Related

Using a For Loop to multiply variables by numbers in a matrix

This should be relatively simple but I am new to R and cannot quite figure this out.
I will illustrate what I am trying to do.
I have the following:
names <- c("A","B","C")
values <- c(3,6,9)
values2 <- c(5,10,15)
y <- c("2019")
r <- c("1")
t <- c("Team A", "Team B", "Team C")
mgn <- c(33, 56, 63)
df1 <- data.frame(names,y,r,t,values,values2,mgn)
I also have a matrix:
numbers <- matrix(1:6, nrow = 3, ncol = 2)
I am trying to loop through each of the values and values2 in my df1 and multiply these by the values in my numbers matrix like so:
3 x 1 = 3
5 x 4 = 20
6 x 2 = 12
10 x 5 = 50
9 x 3 = 27
15 x 6 = 90
I would then like to print each of these values like:
values values2
[1] 3 20
[2] 12 50
[3] 18 90
I tried the following (just for the first values col):
for(col in 1:ncol(numbers)){
df1$values %*% numbers[col]
print(df1$values)
}
But this is the ouput I get:
[1] 3 6 9
[1] 6 12 18
[1] 6 12 18
[1] 12 24 36
[1] 12 24 36
[1] 24 48 72
I then would like to repeat the process, so that the next row of values and values2 is multiplied by the first row again in numbers (2 and 5) so that:
3 x 2 = 6
5 x 5 = 25
and so on, until all the combinations are calculated.
This would give me the output like so:
3 x 1 = 3
5 x 4 = 20
6 x 1 = 6
10 x 4 = 40
9 x 1 = 9
15 x 4 = 60
Then it should go to the next line of values and values2 and repeat:
3 x 2 = 6
5 x 5 = 25
6 x 2 = 12
10 x 5 = 50
9 x 2 = 18
15 x 5 = 75
And finally the last line:
3 x 3 = 9
5 x 6 = 30
6 x 3 = 18
10 x 6 = 60
9 x 3 = 27
15 x 6 = 90
Finally, I would like to loop through each of these, add them together like:
sumvalues = values + values2
create a total column like:
df1%>%group_by(y, r, t)%>%dplyr::mutate(total=sum(sumvalues)
then obtain the pearson correlation for each by:
cor(mgn, sumvalues, method = "pearson")
So I can have the output like so:
sumvalues total mgn pearson
[1]
[2]
[3]
Here's how I did it:
#### make the two objects to have the same dimensions:
df2<-df1[ ,c(2:3)]
#### multiply and create new object:
new<-df2*numbers
#### if you want to return the first column to df1:
df3<-cbind(df1[1],x)
print(df3)
Your first output can be reached by :
df1[-1] * numbers
# values values2
#1 3 20
#2 12 50
#3 27 90
To get all possible combinations you can use apply with sweep :
apply(numbers, 1, function(x) sweep(df1[-1], 2, x, `*`))
#[[1]]
# values values2
#1 3 20
#2 6 40
#3 9 60
#[[2]]
# values values2
#1 6 25
#2 12 50
#3 18 75
#[[3]]
# values values2
#1 9 30
#2 18 60
#3 27 90

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

Subseting column in one data frame using two columns in another data frame in r

I have tried for the similar problem on SO but couldn't.
I have two data frames. I want to subset one column in one data frame using two columns in another data frame.
The data frame are as following.
df1 <- data.frame(x = c(22,23,22,34,21),
y = c(1,4,2,3,2))
df1
x y
1 22 1
2 23 4
3 22 2
4 34 3
5 21 2
df2 <- data.frame(a = c("John", "Matt", "foo","boo"),
b = c(4, NA, NA,2),
c = c(3, NA, 3, 3))
df2
a b c
1 John 4 3
2 Matt NA NA
3 foo NA 3
4 boo 2 3
I want to subset column df1$y using column b and c from dataframe df2 using vectorized operation.
The output should in list form as following
df1
df1[1]
x y
2 23 4
4 34 3
df1[2]
df1[3]
x y
4 34 3
df1[4]
x y
3 22 2
4 34 3
5 21 2
You can try something like this:
dfnew<-list()
for (i in 1:nrow(df2)){
dfnew[[i]]<-df1[which(df1$y %in% df2[i,2:3]),]
}
Result:
dfnew
[[1]]
x y
2 23 4
4 34 3
[[2]]
[1] x y
<0 rows> (or 0-length row.names)
[[3]]
x y
4 34 3
[[4]]
x y
3 22 2
4 34 3
5 21 2
We can use lapply
lapply(split(df2[-1], as.character(df2$a)), function(x) df1[df1$y %in% unlist(x),])

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources