I have the next question.
If I have the following data frame db and I want to rearrange the columns so that they the NULL columns stay at the ends (as in db2).
How can I do it dynamically?
Thank you
db<-data.frame(N=c(2,4,6,8),
a=c(1,1,1,1),
b=c(1,1,1,1),
c=c(NA,1,1,1),
d=c(NA,1,1,1),
e=c(NA,NA,1,1),
f=c(NA,NA,1,1),
g=c(NA,NA,NA,1),
h=c(NA,NA,NA,1))
db2<-data.frame(N=c(2,4,6,8),
a=c(NA,NA,NA,1),
b=c(NA,NA,1,1),
c=c(NA,1,1,1),
d=c(1,1,1,1),
e=c(1,1,1,1),
f=c(NA,1,1,1),
g=c(NA,NA,1,1),
h=c(NA,NA,NA,1))
N a b c d e f g h
1 2 NA NA NA 1 1 NA NA NA
2 4 NA NA 1 1 1 1 NA NA
3 6 NA 1 1 1 1 1 1 NA
4 8 1 1 1 1 1 1 1 1
If the number of NAs per row are always even, then loop through the rows, rearrange the NA by appending half the NAs at the start and end
db[-1] <- t(apply(db[-1], 1, function(x) {
i1 <- is.na(x)
if(sum(i1) > 0) setNames(c(rep(NA,sum(i1)/2), x[!i1],
rep(NA, sum(i1)/2)), names(x)) else x}))
db
# N a b c d e f g h
#1 2 NA NA NA 1 1 NA NA NA
#2 4 NA NA 1 1 1 1 NA NA
#3 6 NA 1 1 1 1 1 1 NA
#4 8 1 1 1 1 1 1 1 1
Related
I have a data frame that looks like this;
df <- data.frame(Trip =c(rep("A",10),rep("B",10)),
State =c(0,0,0,1,1,1,0,0,1,0,0,1,1,0,0,0,1,1,1,0),
Distance = c(0,2,9,4,3,1,4,5,6,3,2,6,1,5,3,3,6,1,8,2),
DistanceToNext = c(NA,NA,NA,3,1,15,NA,NA,NA,NA,NA,1,17,NA,NA,NA,1,8,NA,NA))
Trip State Distance DistanceToNext
1 A 0 1 NA
2 A 0 2 NA
3 A 0 9 NA
4 A 1 4 3
5 A 1 3 1
6 A 1 1 15
7 A 0 4 NA
8 A 0 5 NA
9 A 1 6 NA
10 A 0 3 NA
11 B 0 2 NA
12 B 1 6 1
13 B 1 1 17
14 B 0 5 NA
15 B 0 3 NA
16 B 0 3 NA
17 B 1 6 1
18 B 1 1 8
19 B 1 8 NA
20 B 0 2 NA
The State column indicates whether a fishing boat is fishing (State = 1) or not fishing (State = 0). I want to calculate the Distance travelled between each fishing event (State = 1).
The Distance column indicates the distance between that rows location and the previous row (e.g. it is the lag distance).
The DistanceToNext column is the answer I am trying to generate, it should be NA for all rows in the Trip until the first row where the fishing State = 1. For this row DistanceToNext should equal the sum of the Distance column of subsequent rows until the next fishing State = 1.
For example row 4 is the first fishing event (State = 1) in Trip A, the DistanceToNext cell should be the Distance travelled before the next fishing event which in his case is the very next row (row 5) which has a distance of 3.
For row 5 the next fishing event is again the very next row (row 6) which has a distance of 1. However for row 6 we see that there isn't another fishing event until row 9 so I want a cumulative sum of the d column for the rows between 6 and 9 which is 15.
If it is the last State = 1 row in it's x grouping (A or B) then there isn't another fishing event so there is not distance to calculate so I want it to give NA.
Here is another solution you could use. I also used a custom function for every State/ Distance vectors in each group that results in the desired output:
fn <- function(State, Distance) {
out <- rep(NA, length(State))
inds <- which(State == 1)
for(i in inds) {
if(State[i] == 1 & State[i + 1] == 1) {
out[i] <- Distance[i + 1]
} else if (State[i] == 1 & State[i + 1] == 0 & i != inds[length(inds)]) {
nx <- which(inds == i)
out[i] <- sum(Distance[(i+1):(inds[nx + 1])])
} else {
NA
}
}
out
}
df %>%
group_by(Trip) %>%
mutate(MyDistance = fn(State, Distance))
# A tibble: 20 x 5
# Groups: Trip [2]
Trip State Distance DistanceToNext MyDistance
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 NA NA
2 A 0 2 NA NA
3 A 0 9 NA NA
4 A 1 4 3 3
5 A 1 3 1 1
6 A 1 1 15 15
7 A 0 4 NA NA
8 A 0 5 NA NA
9 A 1 6 NA NA
10 A 0 3 NA NA
11 B 0 2 NA NA
12 B 1 6 1 1
13 B 1 1 17 17
14 B 0 5 NA NA
15 B 0 3 NA NA
16 B 0 3 NA NA
17 B 1 6 1 1
18 B 1 1 8 8
19 B 1 8 NA NA
20 B 0 2 NA NA
In base R you would do:
fun <- function(df){
a <- which(df$State == 1)
b <- rep(NA, nrow(df))
d <- mapply(function(x, y) sum(df$Distance[(x+1):y]), head(a,-1), tail(a, -1))
b[a] <- c(d, NA)
transform(df, DisttoNext = b)
}
do.call(rbind, by(df, df$Trip, fun))
Trip State Distance DistanceToNext DisttoNext
A.1 A 0 0 NA NA
A.2 A 0 2 NA NA
A.3 A 0 9 NA NA
A.4 A 1 4 3 3
A.5 A 1 3 1 1
A.6 A 1 1 15 15
A.7 A 0 4 NA NA
A.8 A 0 5 NA NA
A.9 A 1 6 NA NA
A.10 A 0 3 NA NA
B.11 B 0 2 NA NA
B.12 B 1 6 1 1
B.13 B 1 1 17 17
B.14 B 0 5 NA NA
B.15 B 0 3 NA NA
B.16 B 0 3 NA NA
B.17 B 1 6 1 1
B.18 B 1 1 8 8
B.19 B 1 8 NA NA
B.20 B 0 2 NA NA
A data.table alternative.
library(data.table)
setDT(df)
df[,`:=`(next_dist = shift(Distance, type = "lead"), g = cumsum(State), ri = .I),
by = Trip]
d = df[ , .(ri = ri[1], State = State[1], s = sum(next_dist)), by = .(Trip, g)]
df[d[State == 1, .SD[-.N], by = Trip], on = .(ri), s := s]
df[ , `:=`(ri = NULL, next_dist = NULL, g = NULL)]
# Trip State Distance DistanceToNext s
# 1: A 0 0 NA NA
# 2: A 0 2 NA NA
# 3: A 0 9 NA NA
# 4: A 1 4 3 3
# 5: A 1 3 1 1
# 6: A 1 1 15 15
# 7: A 0 4 NA NA
# 8: A 0 5 NA NA
# 9: A 1 6 NA NA
# 10: A 0 3 NA NA
# 11: B 0 2 NA NA
# 12: B 1 6 1 1
# 13: B 1 1 17 17
# 14: B 0 5 NA NA
# 15: B 0 3 NA NA
# 16: B 0 3 NA NA
# 17: B 1 6 1 1
# 18: B 1 1 8 8
# 19: B 1 8 NA NA
# 20: B 0 2 NA NA
Explanation:
Convert data to data.table (setDT(df)).
For each 'Trip' (by = Trip), create new variables by reference (:=): next distance (shift(Distance, type = "lead")), a grouping variable which increases everytime 'State' is 1 (cumsum(State)), a row index used to join result (.I; this also could be done first, without the grouping).
For each 'Trip' and 'State group' (by = .(Trip, g)), select first row index (ri[1]), first 'State' (State = State[1]), and sum the lead distances (sum(next_dist)).
From the result above, select rows where 'State' is 1 (State == 1). Then, for each 'Trip' (by = Trip), select the subset of data (.SD) except the last row (-.N). Join to the original data on row index (on = .(ri)). Create a new column, sum of distances, 's' by reference (:=). If desired, remove temp variables.
df <- data.frame(ID=c(1,2,3,4,5,6),
CO=c(-6,4,2,3,0,2),
CATFOX=c(1,NA,NA,3,0,NA),
DOGFOX=c(NA,NA,5,1,2,NA),
RABFOX=c(NA,3,NA,5,3,NA),
D=c(0,4,5,6,1,2),
WANT=c(1,3,5,3,0,NA))
I have a dataframe and i wish to make column WANT take the first value of 'CATFOX' 'DOGFOX' 'RABFOX' that is not NA. Is there a data.table solution? I tried this but it did not produce the desired outcome:
df$WANT=do.call(coalesce, data[grepl('FOX',names(data))])
You have coalesce in your example which is dplyr's construct. Try fcoalesce:
library(data.table)
setDT(df)[, WANT2 := fcoalesce(CATFOX, DOGFOX, RABFOX)]
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1: 1 -6 1 NA NA 0 1 1
2: 2 4 NA NA 3 4 3 3
3: 3 2 NA 5 NA 5 5 5
4: 4 3 3 1 5 6 3 3
5: 5 0 0 2 3 1 0 0
6: 6 2 NA NA NA 2 NA NA
We can use a vectorized option in base R
i1 <- endsWith(names(df), 'FOX')
df$WANT2 <- df[i1][cbind(seq_len(nrow(df)), max.col(!is.na(df[i1]), 'first'))]
df$WAN2
#[1] 1 3 5 3 0 NA
You could try this base R solution:
#Data
data=data.frame(ID=c(1,2,3,4,5),
CO=c(-6,4,2,3,0),
CATFOX=c(1,NA,NA,3,0),
DOGFOX=c(NA,NA,5,1,2),
RABFOX=c(NA,3,NA,5,3),
D=c(0,4,5,6,1),
WANT=c(1,3,5,3,0))
#Process
index <- which(names(data) %in% c('CATFOX','DOGFOX','RABFOX'))
data$WANT2 <- apply(data[,index],1,function(x) x[min(which(!is.na(x)))])
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1 1 -6 1 NA NA 0 1 1
2 2 4 NA NA 3 4 3 3
3 3 2 NA 5 NA 5 5 5
4 4 3 3 1 5 6 3 3
5 5 0 0 2 3 1 0 0
I have a dataframe in R:
Subject T O E P Score
1 0 1 0 1 256
2 1 0 1 0 325
2 0 1 0 1 125
3 0 1 0 1 27
4 0 0 0 1 87
5 0 1 0 1 125
6 0 1 1 1 100
This is just a display of the dataframe. In reality, I have a lot of lines for each of the subjects. But the subjects are only from 1 to 6
For each Subject, the possible values are:
T : 0 or 1
O : 0 or 1
E : 0 or 1
P : 0 or 1
Score : Numeric value
I want to create a new dataframe with 6 lines (one for each subject) and the calculated MEAN score for each of these combinations :
T , O , E , P , TO , TE, TP, OE , OP , PE , TOP , TOE , POE , PET
The above will the columns of the new dataframe.
The final output should look like this
Subject T O E P TO TE TP OE OP PE TOP TOE POE PET
1
2
3
4
5
6
For each of these lines x columns the value is the MEAN SCORE
I tried aggregate and table but I can't seem to get what I want
Sorry I am new to R
Thanks
I had to rebuild sample data to answer the question as I understood it, tell me if it works for you :
set.seed(2)
df <- data.frame(subject=sample(1:3,9,T),
T = sample(c(0,1),9,T),
O = sample(c(0,1),9,T),
E = sample(c(0,1),9,T),
P = sample(c(0,1),9,T),
score=round(rnorm(9,10,3)))
# subject T O E P score
# 1 1 1 0 0 1 12
# 2 3 1 0 1 0 9
# 3 2 0 1 0 1 13
# 4 1 1 0 0 0 3
# 5 3 0 1 0 1 14
# 6 3 0 0 1 0 13
# 7 1 1 0 1 0 17
# 8 3 1 0 1 0 12
# 9 2 0 0 1 1 14
cols1 <- c("T","O","E","P")
df$comb <- apply(df[cols1],1,function(x) paste(names(df[cols1])[as.logical(x)],collapse=""))
# subject T O E P score comb
# 1 1 1 0 0 1 12 TP
# 2 3 1 0 1 0 9 TE
# 3 2 0 1 0 1 13 OP
# 4 1 1 0 0 0 3 T
# 5 3 0 1 0 1 14 OP
# 6 3 0 0 1 0 13 E
# 7 1 1 0 1 0 17 TE
# 8 3 1 0 1 0 12 TE
# 9 2 0 0 1 1 14 EP
library(tidyverse)
df %>%
group_by(subject,comb) %>%
summarize(score=mean(score)) %>%
spread(comb,score) %>%
ungroup
# # A tibble: 3 x 7
# subject E EP OP T TE TP
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 NA NA NA 3 17.0 12
# 2 2 NA 14 13 NA NA NA
# 3 3 13 NA 14 NA 10.5 NA
The second step in base R:
means <- aggregate(score ~ subject + comb,df,mean)
means2 <- reshape(means,timevar="comb",idvar="subject",direction="wide")
setNames(means2,c("subject",sort(unique(df$comb))))
# subject E EP OP T TE TP
# 1 3 13 NA 14 NA 10.5 NA
# 2 2 NA 14 13 NA NA NA
# 5 1 NA NA NA 3 17.0 12
I'd do it like this:
# using your table data
df = read.table(text =
"Subject T O E P Score
1 0 1 0 1 256
2 1 0 1 0 325
2 0 1 0 1 125
3 0 1 0 1 27
4 0 0 0 1 87
5 0 1 0 1 125
6 0 1 1 1 100", stringsAsFactors = FALSE, header=TRUE)
# your desired column names
new_names <- c("T", "O", "E", "P", "TO", "TE", "TP", "OE",
"OP", "PE", "TOP", "TOE", "POE", "PET")
# assigning each of your scores to one of the desired column names
assign_comb <- function(dfrow) {
selection <- c("T", "O", "E", "P")[as.logical(dfrow[2:5])]
do.call(paste, as.list(c(selection, sep = "")))
}
df$comb <- apply(df, 1, assign_comb)
# aggregate all the means together
df_agg <- aggregate(df$Score ~ df$comb + df$Subject, FUN = mean)
# reshape the data to wide format
df_new <- reshape(df_agg, v.names = "df$Score", idvar = "df$Subject",
timevar = "df$comb", direction = "wide")
# clean up the column names to match your desired output
# any column names not found will be added as NA
colnames(df_new) <- gsub("df\\$|Score\\.", "", colnames(df_new))
df_new[, new_names[!new_names %in% colnames(df_new)]] <- NA
df_new <- df_new[, c("Subject", new_names)]
With the result:
> df_new
Subject T O E P TO TE TP OE OP PE TOP TOE POE PET
1 1 NA NA NA NA NA NA NA NA 256 NA NA NA NA NA
2 2 NA NA NA NA NA 325 NA NA 125 NA NA NA NA NA
4 3 NA NA NA NA NA NA NA NA 27 NA NA NA NA NA
5 4 NA NA NA 87 NA NA NA NA NA NA NA NA NA NA
6 5 NA NA NA NA NA NA NA NA 125 NA NA NA NA NA
7 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
My dataframe, D is like this.
D$fit has both distance (0:6) and dg (1:3) info
D <- read.table(header = TRUE, text = "
distance dg fit
1 0 1 A
2 1 1 B
3 2 1 C
4 3 1 D
5 4 1 E
6 5 1 F
7 6 1 G
8 0 2 H
9 1 2 I
10 2 2 J
11 3 2 K
12 4 2 L
13 5 2 M
14 0 3 O
15 1 3 P
16 2 3 Q
17 3 3 R
")
I want to assign fit values to this matrix, md, corresponding to distance and dg.
md <- matrix(1:21, nrow = 7)
colnames(md) <- c(1:3)
rownames(md) <- c(0:6)
md[] <- NA
1 2 3
0 NA NA NA
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
I've tried but failed with this code
cmd = expand.grid(i=seq(0,6), j = seq(1,3))
i <- seq(0,6)
j <- seq(1,3)
md[i,j] <- D$fit[D$distance == cmd[1] & D$dg == cmd[2]]
We can use acast from library(reshape2)
library(reshape2)
acast(D, distance~dg, value.var="fit")
Or with reshape from base R
reshape(D, idvar="distance", timevar="dg", direction="wide")
I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1