How to change the way split returns values in R? - r

I'm working on a project and I want to take a matrix, split it by the values w and x, and then for each of those splits find the maximum value of y.
Here's an example matrix
>rah = cbind(w = 1:6, x = 1:3, y = 12:1, z = 1:12)
>rah
w x y z
[1,] 1 1 12 1
[2,] 2 2 11 2
[3,] 3 3 10 3
[4,] 4 1 9 4
[5,] 5 2 8 5
[6,] 6 3 7 6
[7,] 1 1 6 7
[8,] 2 2 5 8
[9,] 3 3 4 9
[10,] 4 1 3 10
[11,] 5 2 2 11
[12,] 6 3 1 12
So I run split
> doh = split(rah, list(rah[,1], rah[,2]))
> doh
$`1.1`
[1] 1 1 1 1 12 6 1 7
$`2.1`
integer(0)
$`3.1`
integer(0)
$`4.1`
[1] 4 4 1 1 9 3 4 10
$`5.1`
integer(0)
$`6.1`
integer(0)
$`1.2`
integer(0)
$`2.2`
[1] 2 2 2 2 11 5 2 8
$`3.2`
integer(0)
$`4.2`
integer(0)
$`5.2`
[1] 5 5 2 2 8 2 5 11
...
So I'm a bit confused as to how take the output of split and use it to sort the rows with the matching combination of w and x values (Such as row 1 compared to row 7) and then compared them to find the one with the high y value.
EDIT: Informative answers so far but I just realized that I forgot to mention one very important part: I want to keep the whole row (x,w,y,z).

Use aggregate instead
> aggregate(y ~ w + x, max, data=rah)
w x y
1 1 1 12
2 4 1 9
3 2 2 11
4 5 2 8
5 3 3 10
6 6 3 7
If you want to use split, try
> split_rah <- split(rah[,"y"], list(rah[, "w"], rah[, "x"]))
> ind <- sapply(split_rah, function(x) length(x)>0)
> sapply(split_rah[ind], max)
1.1 4.1 2.2 5.2 3.3 6.3
12 9 11 8 10 7
Just for the record, summaryBy from doBy package also works in the same fashion of aggregate
> library(doBy)
> summaryBy(y ~ w + x, FUN=max, data=as.data.frame(rah))
w x y.max
1 1 1 12
2 2 2 11
3 3 3 10
4 4 1 9
5 5 2 8
6 6 3 7
data.table solution:
> library(data.table)
> dt <- data.table(rah)
> dt[, max(y), by=list(w, x)]
w x V1
1: 1 1 12
2: 2 2 11
3: 3 3 10
4: 4 1 9
5: 5 2 8
6: 6 3 7

> tapply(rah[,"y"], list( rah[,"w"], rah[,"x"]), max)
1 2 3
1 12 NA NA
2 NA 11 NA
3 NA NA 10
4 9 NA NA
5 NA 8 NA
6 NA NA 7

Another option using plyr package:
ddply(as.data.frame(rah),.(w,x),summarize,z=max(y))
w x z
1 1 1 12
2 2 2 11
3 3 3 10
4 4 1 9
5 5 2 8
6 6 3 7

Related

How do I select rows in a data frame before and after a condition is met?

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris
here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.
Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]
library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

Generating Permutations of Values Within Multiple Lists [duplicate]

This question already has an answer here:
All possible combinations of elements from different bins (one element from every bin) [duplicate]
(1 answer)
Closed 6 years ago.
I'm trying to generate permutations by taking 1 value from 3 different lists
l <- list(A=c(1:13), B=c(1:5), C=c(1:3))
Desired result => Matrix of all the permutations where the first value can be 1-13, second value can be 1-5, third value can be 1-3
I tried using permn from the combinat package, but it seems to just rearrange the 3 lists.
> permn(l)
[[1]]
[[1]]$A
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[[1]]$B
[1] 1 2 3 4 5
[[1]]$C
[1] 1 2 3
[[2]]
[[2]]$A
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[[2]]$C
[1] 1 2 3
[[2]]$B
[1] 1 2 3 4 5
....
Expected output
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 2 1
[3,] 1 1 2
[4,] 1 1 3
and so on...
We can use expand.grid. It can directly be applied on the list
expand.grid(l)
You can create a data frame using do.call and expand.grid, if you really need a matrix, then use as.matrix on the result:
> l <- list(A=c(1:13), B=c(1:5), C=c(1:3))
> out <- do.call(expand.grid, l)
> head(out)
A B C
1 1 1 1
2 2 1 1
3 3 1 1
4 4 1 1
5 5 1 1
6 6 1 1
> tail(out)
A B C
190 8 5 3
191 9 5 3
192 10 5 3
193 11 5 3
194 12 5 3
195 13 5 3
> tail(as.matrix(out))
A B C
[190,] 8 5 3
[191,] 9 5 3
[192,] 10 5 3
[193,] 11 5 3
[194,] 12 5 3
[195,] 13 5 3
>

How to replace the NA values after merge two data.frame? [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 7 years ago.
I have two data.frame as the following:
> a <- data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,3,5,7,9,11,13,15))
> a
x y
1 1 1
2 2 3
3 3 5
4 4 7
5 5 9
6 6 11
7 7 13
8 8 15
> b <- data.frame(x=c(1,5,7), z=c(2, 4, 6))
> b
x z
1 1 2
2 5 4
3 7 6
Then I use "join" for two data.frames:
> c <- join(a, b, by="x", type="left")
> c
x y z
1 1 1 2
2 2 3 NA
3 3 5 NA
4 4 7 NA
5 5 9 4
6 6 11 NA
7 7 13 6
8 8 15 NA
My requirement is to replace the NAs in the Z column by the last None-Na value before the current place. I want the result like this:
> c
x y z
1 1 1 2
2 2 3 2
3 3 5 2
4 4 7 2
5 5 9 4
6 6 11 4
7 7 13 6
8 8 15 6
This time (if your data is not too large) a loop is an elegant option:
for(i in which(is.na(c$z))){
c$z[i] = c$z[i-1]
}
gives:
> c
x y z
1 1 1 2
2 2 3 2
3 3 5 2
4 4 7 2
5 5 9 4
6 6 11 4
7 7 13 6
8 8 15 6
data:
library(plyr)
a <- data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,3,5,7,9,11,13,15))
b <- data.frame(x=c(1,5,7), z=c(2, 4, 6))
c <- join(a, b, by="x", type="left")
You might also want to check na.locf in the zoo package.

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

Reflecting changes in a dataframe by Modifying a list containing the dataframe

I have a list containing a couple of data frames. I'm applying lapply on the list and directing output to the same list itself. I expected that this would change the dataframes themselves, but it doesn't. Can someone help with this? I guess it should be quite straight forward, but can't find anything that helps.
Thanks.
Sample data: (Source: Change multiple dataframes in a loop)
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
ll <- lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
df
})
Results:
ll
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
[[2]]
a b c log_a tans_col
1 6 2 8 1.7917595 16
2 0 7 4 -Inf 11
3 9 2 1 2.1972246 12
4 1 2 9 0.0000000 12
5 2 1 2 0.6931472 5
[[3]]
a b c log_a tans_col
1 0 4 2 -Inf 6
2 0 1 9 -Inf 10
3 1 9 7 0.000000 17
4 5 2 1 1.609438 8
5 1 3 1 0.000000 5
data_frame1
a b c
1 1 3 4
2 5 6 4
3 3 1 1
4 3 5 9
5 2 5 2

Resources