I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)
Related
I've created a dynamic column name w/ dplyr::mutate() based on this thread Use dynamic variable names in `dplyr` and now I want to sort the new column.... but I'm not correctly passing the column name
library(glue)
library(dplyr)
# data
set.seed(123)
df <- data.frame(distance = sample(1:100, size = 10))
# custom function
multiply_function <- function(df, metric, multiplier){
df %>%
mutate(., "{{metric}}_x{{multiplier}}" := {{metric}} * multiplier) %>%
arrange(desc("{{metric}}_x{{multiplier}}")) # <--- this is not working
}
df %>%
multiply_function(., metric = distance, multiplier = 3)
distance distance_x3
1 31 93
2 79 237
3 51 153
4 14 42
5 67 201
6 42 126
7 50 150
8 43 129
9 97 291
10 25 75
Unfortunately I don't know if any way to use that nice glue syntax with anything that's not on the left side of a :=. That's there the magic happens. You can get something to work if you take care of the explicity conversion to sumbol your self and do the string building manually. It's not pretty, but this works
multiply_function <- function(df, metric, multiplier){
metric <- ensym(metric)
newname <- glue::glue("{rlang::as_string(metric)}_x{as.character(multiplier)}")
df %>%
mutate("{newname}" := !!metric * multiplier) %>%
arrange(desc(.data[[newname]]))
}
I'm not sure this arranging step is best placed within a function if that function is part of a pipe itself. Problems may emerge if trying to make several variables. For one new variable:
multiply_function <- function(df, metric, multiplier){
df %>%
mutate("{{metric}}_x{{multiplier}}" := {{metric}} * multiplier) %>%
arrange(desc(!!rlang::sym(setdiff(names(.), names(df)))))
}
The following dataframe is a subset of a bigger df, which contains duplicated information
df<-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
In Base R, I would like to get the following:
df_result<-data.frame(Caught=226,
Retained=96,
Units=390,
Hours=72,
colour="red")
So basically the results is the sum of unique values for columns Caught, Retained, Units and leaving the same value for Hours and colour (Caught=92+134, Retained=49+47, Units= 170+220, Hours=72, colour="red)
However, I intend to do this in a much bigger data.frame with several columns. My idea was to apply a function based on column names as:
l <- lapply(df, function(x) {
if(names(x) %in% c("Caught","Discarded","Units"))
sum(unique(x))
else
unique(x)
})
as.data.frame(l)
However, this does not work, as I am not entirely sure how to extract vector names when using lapply() and other functions such as this.
I have tried withouth succes to implement by(), apply() functions.
Thanks
Asking for Base R:
l <- lapply( df, function(n) {
if( is.numeric(n) )
sum( unique(n) )
else
unique( n )
})
as.data.frame(l)
This solution takes advantage of the fact that data.frames are really just lists of vectors.
It produces this:
# Caught Discarded Units Hours Colour
# 226 96 390 72 red
A proposition:
df <-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
df
#> Caught Discarded Units Hours Colour
#> 1 92 49 170 72 red
#> 2 134 47 170 72 red
#> 3 92 49 220 72 red
#> 4 134 47 220 72 red
df_results <- data.frame(Caught = sum(unique(df$Caught)),
Discarded = sum(unique(df$Discarded)),
Units = sum(unique(df$Units)),
Hours = unique(df$Hours),
Colour = unique(df$Colour))
df_results
#> Caught Discarded Units Hours Colour
#> 1 226 96 390 72 red
# Created on 2021-02-23 by the reprex package (v0.3.0.9001)
Regards,
I premise I'm new with R and actually I'm trying to get the fundamentals.
Currently I'm workin on a large dataframe (called "ppl") which I have to edit in order to filter some rows. Each row is included in a group and it is characterized by an intensity (into) value and a sample value.
mz rt into sample tracker sn grp
100.0153 126 2.762664 3 11908 7.522655 0
100.0171 127 2.972048 2 5308 7.718521 0
100.0788 272 30.217969 2 5309 19.024807 1
100.0796 272 17.277916 3 11910 7.297716 1
101.0042 128 37.557324 3 11916 27.991320 2
101.0043 128 39.676014 2 5316 28.234918 2
Well, the first question is: "How can I select from each group the sample with the highest intensity?"
I tried a for loop:
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
The fact is that it works for ppl$grp == 0, but the next cycles return NAs rows.
Then the filtered dataframe(called "sel") also should store the sample values of the removed rows. It should be as follows:
mz rt into sample tracker sn grp
100.0171 127 2.972048 c(2,3) 5308 7.718521 0
100.0788 272 30.217969 c(2,3) 5309 19.024807 1
101.0043 128 39.676014 c(2,3) 5316 28.234918 2
In order to get this I would use this approach:
lev<-factor(ppl$grp)
samp<-ppl$sample
samp2<-split(samp,lev)
sel$sample<-samp2
Any hint? Because I cannot test it since I still don't have solved the previous problem.
Thanks a lot.
Not sure if I follow your question. But maybe this will get you started.
library(dplyr)
ppl %>% group_by(grp) %>% filter(into == max(into))
A base R option using ave is
ppl[with(ppl, ave(into, grp, FUN = max)==into),]
If the 'sample' column in the expected output have the unique elements in each 'grp', then after grouping by 'grp', update the 'sample' as the pasted unique elements of 'sample', then arrange the 'into' descendingly and slice the 1st row.
library(dplyr)
ppl %>%
group_by(grp) %>%
mutate(sample = toString(sort(unique(sample)))) %>%
arrange(desc(into)) %>%
slice(1L)
# mz rt into sample tracker sn grp
# <dbl> <int> <dbl> <chr> <int> <dbl> <int>
#1 100.0171 127 2.972048 2, 3 5308 7.718521 0
#2 100.0788 272 30.217969 2, 3 5309 19.024807 1
#3 101.0043 128 39.676014 2, 3 5316 28.234918 2
A data.table alternative:
library(data.table)
setkey(setDT(ppl),grp)
ppl <- ppl[ppl[,into==max(into),by=grp]$V1,]
## mz rt into sample tracker sn grp
##1: 100.0171 127 2.972048 2 5308 7.718521 0
##2: 100.0788 272 30.217969 2 5309 19.024807 1
##3: 101.0043 128 39.676014 2 5316 28.234918 2
I have no idea why this code would work
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
max(temp$into) should return the maximum value--which appears to not be an integer in most cases.
Also, building a data.frame with rbind in every for loop instance is not good practice (in any language). It requires quit a bit of type checking and array growing that can get very expensive.
Also, max will return NA when there are any NAs for that group.
There is also a question about what you want to do about ties? Do you just want one result or all of them? The code Akrun gives will give you all of them.
This code will write a new column that has the group max
ppl$grpmax <- ave(ppl$into, ppl$grp, FUN=function(x) { max(x, na.rm=TRUE ) } )
You can then select all values in a group that are equal to the max with
pplmax <- subset(ppl, into == grpmax)
If you want just one per group then you can remove duplicates
pplmax[!duplicated(pplmax$grp),]
I am struggling to iterate 2 loops over all the files in a folder. I have over 600 .csv files, which contain information about the latency and duration of saccades made in a sentence. They look like this:
order subject sentence latency duration
1 1 1 641 76
2 1 1 98 57
3 1 1 252 49
4 1 1 229 43
For each of the files, I want to create 2 new columns called Start and End, to calculate the start and end point of each saccade. The values in each of those are calculated from the values in the latency and duration columns. I can do this using a loop for each file, like so:
SentFile = read.csv(file.choose(), header = TRUE, sep = ",")
# Calculate Start
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] +
SentFile$Duration[i] + SentFile$Latency[i+1]}
#Calculate End
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}
And then the result looks like this:
order subject sentence latency duration Start End
1 1 1 641 76 641 717
2 1 1 98 57 815 872
3 1 1 252 49 1124 1173
4 1 1 229 43 1402 1445
I am sure there is probably a more efficient way of doing it, but it is very important to use the precise cells specified in the loop to calculate the Start and End values and that was the only way I could think of to get it to work for each individual file.
As I said, I have over 600 files, and I want to be able to calculate the Start and End values for the entire set and add the new columns to each file. I tried using lapply, like this:
sent_files = list.files()
lapply(sent_files, function(x){
SentFile = read.csv(x, header = TRUE, sep = ",")
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] + SentFile$Duration[i]
+ SentFile$Latency[i+1]}
#Calculate End of Saccade Absolute Time Stamp #######
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}})
However, I keep getting this error message:
Error in `$<-.data.frame`(`*tmp*`, "SacStart", value = c(2934L, NA)):replacement has 2 rows, data has 1
I would really appreciate any help in getting this to work!
First, replace for loops:
data <- data.frame(
"order" = c(1,2,3,4), subject = c(1,1,1,1), sentance = c(1,1,1,1), latency= c(641, 98, 252, 229), duration = c(76, 57, 49, 43)
)
data$end <- cumsum(data$latency + data$duration)
data$start <- data$end - data$duration
Secondly, you are not assigning results of the CSV load to your environment variable.
If you want to process all files in one go, change the code for data load to this:
data.list <- lapply(sent_files, function(x){
data <- read.csv(x, header = TRUE, sep = ",")
return(data)
})
data <- do.call("rbind", data.list)
I'm using R to create an occupancy model encounter history. I need to take a list of bird counts for individual leks, separate them by year, then code the count dates into two intervals, either within 10 days of the first count (Interval 1), or after 10 days after the first count (Interval 2). For any year where only 1 count occurred I need to add an entry coded as "U", to indicate that no count occurred during the second interval. Following that I need to subset out only the max count in each year and interval. A sample dataset:
ComplexId Date Males Year category
57 1941-04-15 97 1941 A
57 1942-04-15 67 1942 A
57 1943-04-15 44 1943 A
57 1944-04-15 32 1944 A
57 1946-04-15 21 1946 A
57 1947-04-15 45 1947 A
57 1948-04-15 67 1948 A
57 1989-03-21 25 1989 A
57 1989-03-30 41 1989 A
57 1989-04-13 2 1989 A
57 1991-03-06 35 1991 A
57 1991-04-04 43 1991 A
57 1991-04-11 37 1991 A
57 1991-04-22 25 1991 A
57 1993-03-23 6 1993 A
57 1994-03-06 17 1994 A
57 1994-03-11 10 1994 A
57 1994-04-06 36 1994 A
57 1994-04-15 29 1994 A
57 1994-04-21 27 1994 A
Now here is the code I wrote to accomplish my task, naming the dataframe above "c1" (you'll need to coerce the date column to date, and the category column to character):
c1_Year<-lapply(unique(c1$Year), function(x) c1[c1$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count, while coercing the "u" into the appropriate interval
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} # subsets out max count in each interval
The problem arises during the second for-loop, which when options(error=recover) is enable returns:
Error in c1_Year[[i]] : subscript out of bounds
No suitable frames for recover()
`
At that point the code accomplishes what it was supposed to and adds the extra line to each year with only one count, even though the error message is generated the extra rows with the "U" code are still appended to the data frames. The issue is that I have 750 leks to do this for. So I tried to build the code above into a function, however when I run the function on any data the subscript out of bounds error stops the function from running. I could brute force it and just run the code above for each lek manually, but I was hoping there might be a more elegant solution. What I need to know is why am I getting the subscript out of bounds error, and how can I fix it?
Here's the function I wrote, so that you can see that it doesn't work:
create.OEH<-function(dataset, final_dataframe){
c1_Year<-lapply(unique(dataset$Year), function(x) dataset[dataset$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count,
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} #subset out max count for each interval
df<-rbind.fill(c1_Year) #collapse list into single dataframe
final_dataframe<-df[!duplicated(df[,c("Year", "Interval")]),] #remove ties for max count
}
In this bit of code
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
You are assigning NULL if length(c1_Year[[i]][,1]==1 is not true, which removes those elements from c1_Year entirely.
You probably want
for(i in 1:length(c1_Year)){
if (length(c1_Year[[i]][,1]) == 1) {
c1_Year[[i]] <- rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
}
However, I see you are already using ddply, so you may be able to avoid a lot of your replication.
The ddply(c1, .(Year), ...) splits up c1 into unique years.
c2 <- ddply(c1,
.(Year),
function (x) {
# create 'Interval'
x$Interval <- ifelse(x$Date - x$Date[1] < 10, 1, 2)
# extract max males per interval
o <- ddply(x, .(Interval), subset, Males==max(Males))
# add the 'U' col if no '2' interval
if (all(o$Interval != 2)) {
o <- rbind(o,
list(o$ComplexId, NA, 0, o$Year, 'U', 2))
}
# return the resulting dataframe
o
})
I converted your rbind(.., c(...)) to rbind(.., list(...)) to avoid converting everything back to string (which is what the c does because it cannot handle multiple different types).
Otherwise the code is almost the same as yours.