I am trying to loop over specific pair columns (they have similar names) and create columns based on a conditional statement.
Example dataset:
set.seed(2)
df <- data.frame (id=rep(1:5),
s1=rnorm(5, 0, 3),
s2=rnorm(5, 0, 3),
s2a=rnorm(5, 0, 3),
st1=rnorm(5, 3, 3),
st2=rnorm(5, 3, 3),
st2a=rnorm(5, 3, 3))
> df
id s1 s2 s2a st1 st2 st2a
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910
Column s1 is paired with column st1 etc. I want to indicate 1/0 if the equality between these columns is -3 to 0. E.g. df$ys1<-ifelse(df$s1<=-3 & df$st1>=0, 1, 0). The ultimate aim is to create the final variable yes_no (1/0) to indicate if any of the differences between the pairs of columns are 1 e.g. df$yes_no<-ifelse(df$ys1==1 | df$ys2==1 | df$ys2a==1, 1, 0)
The new dataset should look like this:
> df
id s1 s2 s2a st1 st2 st2a ys1 ys2 ys2a yes_no
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119 0 0 0 0
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712 0 0 0 0
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325 0 0 0 0
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610 1 0 1 1
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910 0 0 0 0
I'm sure there is a way of doing a loop without actually creating all additional columns (i.e. just create the final column, yes_no ) but I would be interested in how to create these just to know how to do it, in addition to a neater method.
I think a way of doing it would be to break up the dataset into two sets based on the pairs and then use in a loop:
firstt<-(df[,c(2:4)])
final<-(df[,c(5:7)])
or skip that and try directly in a loop
for(i in names(df[,c(2:4)])){
r<-(df[,c(5:7)])
df[i] <-ifelse(df$[i]<=-3 & df$[r]>=0, 1, 0)
}
Obviously that wont work but that is the idea of what I was trying.
Any help would be appreciated.
Here a solution in base R:
df$yes_no <-
rowSums(mapply(function(i,r)
ifelse(df[[r]]<=-3 & df[[i]]>=0, 1, 0)
, grep("st",names(df),value=TRUE),
gsub("t","",grep("st",names(df),value=TRUE)))) >0
1- I amusing regex to extract names. You can use indices here also. $X
X = "st1" "st2" "st2a"
Y = "s1" "s2" "s2a"
2- I am using mapply to apply to the paired elemend ( first elt of X and first element of Y and so ..)
3- rowSums to aggregate the 3 columns in one , and >0 to convert it to a logical vector
Here is another solution with for loops
a <- names(df[,c(2:4)])
b <- names(df[,c(5:7)])
for(i in seq_along(a)){
df$temp<-ifelse(df[,names(df)[names(df)==a[i]]]<=-3 & df[,names(df)[names(df)==b[i]]]>=0, 1, 0)
names(df)[names(df)=="temp"] <- paste0("ys", i)
}
df$yes_no <- apply(df[grep("ys", names(df))]==1,1, function(k) ifelse(TRUE %in% k, 1, 0) )
print(df)
id s1 s2 s2a st1 st2 st2a ys1 ys2 ys3 yes_no
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119 0 0 0 0
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712 0 0 0 0
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325 0 0 0 0
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610 1 0 1 1
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910 0 0 0 0
Related
So I have this vector:
a = sample(0:3, size=30, replace = T)
[1] 0 1 3 3 0 1 1 1 3 3 2 1 1 3 0 2 1 1 2 0 1 1 3 2 2 3 0 1 3 2
What I want to have is a list of vectors with all the elements that are separated by n 0s. So in this case, with n = 0 (there can't be any 0 between the consecutive values), this would give:
res = c([1,3,3], [1,1,1,3,3,2,1,1,3], [2,1,1,2]....)
However, I would like to control the n-parameter flexible to that if I would set it for example to 2, that something like this:
b = c(1,2,0,3,0,0,4)
would still result in a result like this
res = c([1,2,3],[4])
I tried a lot of approaches with while loops in for-loops while trying to count the number of 0s. But I just could not achieve it.
Update
I tried to post the question in a more real-world setting here:
Flexibly calculate column based on consecutive counts in another column in R
Thank you all for the help. I just don't seem to manage put your help into practice with my limited knowledge..
Here is a base R option using rle + split for general cases, i.e., values in b is not limited to 0 to 3.
with(
rle(with(rle(b == 0), rep(values & lengths == n, lengths))),
Map(
function(x) x[x != 0],
unname(split(b, cut(seq_along(b), c(0, cumsum(lengths))))[!values])
)
)
which gives (assuming n=2)
[[1]]
[1] 1 2 3
[[2]]
[1] 4
If you have values within ragne 0 to 9, you can try the code below
lapply(
unlist(strsplit(paste0(b, collapse = ""), strrep(0, n))),
function(x) {
as.numeric(
unlist(strsplit(gsub("0", "", x), ""))
)
}
)
which also gives
[[1]]
[1] 1 2 3
[[2]]
[1] 4
I also wanted to paste a somehow useful solution with the function SplitAt from DescTools:
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)])
where a is your intial vector. It gives you a list where every entry contains the pair of numbers between zeros:
If you than add another SplitAt with empty chars, you can create sublist after sublist and split it in as many sublists as you want: e.g.:
n <- 4
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)]) %>% SplitAt(., n)
gives you:
set.seed(1)
a <- sample(0:3, size=30, replace = T)
a
[1] 0 3 2 0 1 0 2 2 1 1 2 2 0 0 0 1 1 1 1 2 0 2 0 0 0 0 1 0 0 1
a2 <- paste(a, collapse = "") # Turns into a character vector, making it easier to handle patterns.
a3 <- unlist(strsplit(a2, "0")) # Change to whatever pattern you want, like "00".
a3 <- a3[a3 != ""] # Remove empty elements
a3 <- as.numeric(a3) # Turn back to numeric
a3
[1] 32 1 221122 11112 2 1 1
I have 2 datasets, i want for each row in datset1 to calculate the difference between all rows in another dataset2. I also replace any negative difference by 0. Here is a simple example of my 2 datasets (because i have datasets around 1000*1000).
df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0), var=c(2.0,5.0))
df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0),var=c(7.0,3.0))
df1
ID Obs var
1 1 1 2
2 2 2 5
df2
ID Obs var
1 2 3 7
2 1 2 3
for(i in 1:nrow(df1)){
b1=as.matrix(df1)
b2=as.matrix(df2)
diff= b1-b2
diff[which(diff < 0 )] <- 0
diff.data= data.frame(cbind(diff, total = rowSums(diff)))
}
diff.data
ID Obs var total
1 0 0 0 0
2 1 0 2 3
This is what i have been able to do, i did the difference between the 2 datasets, replace the negative values by 0 and also was interested to sum the values of the columns after. For the first row in df1 i would like to calculate the difference between all the rows in df2, and for the second row in df1 calculate the difference between all the rows in df2 (and so on). Note that i should not calculate the difference between the IDs (i don't know how to do it, maybe changing diff= b1-b2 by diff= b1[,-1]-b2[,-1]? ). I want to keep the ID from df1 to keep track of my patients (the case of my dataset). I would like to have something like that
diff.data
ID Obs var total
1 0 0 0
1 0 0 0
2 0 0 0
2 0 2 2
I thank you in advance for your help.
Here is what i have using your answer, i wanted to create a simple function. But i would like to have the option that my datasets could be either matrices or dataframes, i was only able to generate an error if the datasets are not dataframes:
difference=function(df1,df2){
if(class(df1) != "data.frame" || class(df2) != "data.frame") stop(" df1 or df2 is not a dataframe!")
df1=data.frame(df1)
df2=data.frame(df2)
ID1=seq(nrow(df1))
ID2=seq(nrow(df2))
new_df1 = df1[rep(ID1, each = nrow(df2)), ]
new_df1[-1] = new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
new_df1[new_df1 < 0] = 0
new_df1$total = rowSums(new_df1[-1])
rownames(new_df1) = NULL
output=new_df1
return(output)
}
I know the fact that i specified df1=data.frame(df1) must be a dataframe its just i don't know how to also include that it could be a matrix.
Thank you again in advance for your help.
You can repeat each row in df1 with for nrow(df2) times and each row in df2 for nrow(df1) times so that the size of dataframes is equal and we can directly subtract the values.
#Repeat each row of df1 nrow(df2) times
new_df1 <- df1[rep(df1$ID, each = nrow(df2)), ]
#Repeat rows of df2 and subtract
new_df1[-1] <- new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
#Replace negative values with 0
new_df1[new_df1 < 0] <- 0
#Add row-wise sum
new_df1$total <- rowSums(new_df1[-1])
#Remove rownames
rownames(new_df1) <- NULL
new_df1
# ID Obs var total
#1 1 0 0 0
#2 1 0 0 0
#3 2 0 0 0
#4 2 0 2 2
I'm using R. I have a data frame that consists of a row for each player and then columns representing each month and a number of points they earned (illustrative data with random values below). I would like to add a new column (Points$ConsecutiveShutouts) that contains the longest consecutive streak for a specified point total over say the past 5 months.
Points <- data.frame("Player" = c("Alpha", "Beta", "Charlie", "Delta", "Echo", "Foxtrot", "Gamma"), "MayPts" = c(floor(runif(7, 0, 3))), "JunPts" = c(floor(runif(7, 0, 3))), "JulPts" = c(floor(runif(7, 0, 3))), "AugPts" = c(floor(runif(7, 0, 3))), "SepPts" = c(floor(runif(7, 0, 3))), "OctPts" = c(floor(runif(7, 0, 3))), "NovPts" = c(floor(runif(7, 0, 3))),"DecPts" = c(floor(runif(7, 0, 3))))
Player MayPts JunPts JulPts AugPts SepPts OctPts NovPts DecPts
Alpha 0 0 1 0 2 2 2 0
Beta 1 0 1 1 1 1 1 2
Charlie 1 2 2 0 2 1 1 0
Delta 0 1 1 2 2 2 0 0
Echo 1 1 0 2 1 2 0 1
Foxtrot 1 0 0 0 0 0 2 1
Gamma 2 0 1 1 0 2 0 1
I have tried using rle(points):
# Establish the start and end months
StartMonth <- which(colnames(Points) == "SepPts")
EndMonth <- which(colnames(Points) == "DecPts")
# Find total of consecutive months with 0 points
Points$ConsecutiveShutOuts <- max(rel(Points[ ,StartMonth:EndMonth] == 0), lengths[!values])
Doing this, I end up with the error "'X' must be a vector of an atomic type"
Any advice on what I am doing wrong and how I can fix? Or alternative approaches?
Thanks in advance! [Beginner here, so hopefully I followed the correct approach to question asking :)]
I would use long form as well. I would first create a function like this.
myfun <- function(series,value){
tmp <- rle(series); runs <- tmp$lengths[tmp$values == value]
if (length(runs)==0) return(0)
else return(max(runs))
}
Using tidyr/dplyr, you can proceed as
library(dplyr)
library(tidyr)
Points %>%
gather(months,Pts,MayPts:DecPts) %>%
group_by(Player) %>%
summarise(x=myfun(tail(Pts,5),0))
# Past 5 month, number of consecutive zeros for each player.
Of course, you can join the result to the original wide-form data frame if you'd like to.
If you want to sum based upon some condition (e.g., only summing points higher than 1), you can melt and restrict the summation only to rows greater than that value.
Points <- as.data.table(Points)
Points <- melt(Points, id="Player", variable.name = "Month", value.name = "PTs")
Points <- Points[PTs>1, list(PTs = sum(PTs, na.rm=TRUE)), by="Player"] #change ">1" if you prefer a different value
I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0
I have a matrix mat whose elements are NA, 0, 1, 2. I got an answer about removing the columns with 0 or NA or both values but now I want to add additional condition for deleting the columns.
I have to delete the columns which contain the same value, delete the columns with NA or 0 or both, the columns with NA or 1 or both and the column with NA or 2 or both (I should keep the columns which have variation in their values)
I used this code but didn't work properly:
mat_nonNA <- mat[, !apply((is.na(mat) | mat == 0) & (is.na(mat) |
mat==1) &(is.na(mat) | mat==2), 2, all)]
thanks for your help
If I understand correctly your request, you can try:
mat_nonNA <- mat[, apply(mat, 2, function(x){length(unique(x[!is.na(x)])) > 1})]
It looks for columns with more than 1 different non NA value.
The result is:
mat_nonNA
# X1.110590170 X1.110906406 X1.110993854 X1.111144756
#A05363 0 0 0 0
#A05370 0 0 0 NA
#A05380 1 2 0 0
#A05397 0 0 1 2
#A05400 2 0 0 0
#A05426 0 NA 0 0
I'm pretty sure you can do it directly, but it would be hard to understand such a complicated condition. Instead, try evaluating the conditions separately first:
zeroOrNA <- apply(is.na(mat) | mat == 0, 2, all)
oneOrNA <- apply(is.na(mat) | mat == 1, 2, all)
twoOrNA <- apply(is.na(mat) | mat == 2, 2, all)
mat_nonNA <- mat[ , !(zeroOrNA | oneOrNA | twoOrNA)]