#BenBolker Here is an example of the output I would like; I have no idea if it can even be done.
CURRENT_FIX_START CURRENT_FIX_END identifier trialtype rotatedimaged
targetloc prefix corrfix errfix
7 202 30 rotated stimN11of2.jpg left 231 254 0
7 208 42 rotated stimN221of2.jpg left 451 245 0
241 761 78 rotated stimW131-of2.jpg right 345 345 0
For each trial, where there is a correct prefix, denoted by having a time stamp, and a correct corrfix, I would want the script to print everything on the same line (I need to to get a latency measure from prefix to corrfix). Within the old data, the time stamps occurred on different lines. I was thinking about doing this manually, but it would be far too time consuming.
Untested, but how about:
latency <- with(mydata,abs(CURRENT_FIX_START-CURRENT_FIX_END))
for (i in c("prefix","corrfix","errfix")) {
mydata[[i]] <- ifelse(mydata[[i]]==1,latency,0)
}
You'll need to adjust for appropriate indexing, but this seems to work:
#fake data
dat <- data.frame(fix1 = runif(10), fix2 = runif(10), prefix = sample(0:1, 10, TRUE),
corfix = sample(0:1, 10, TRUE), errfix = sample(0:1, 10, TRUE))
dat[, 3:5] <- apply(dat[, 3:5], 2, function(x) ifelse(x == 1, abs(dat$fix1 - dat$fix2),x))
Related
I have a dataframe where one column is the amount spent. In the amount spent column there are the values for amount spent and also negative values for any returns. For example.
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
I want to remove the negative value then one of its positive counter parts - the idea is to only keep fully completed spend amounts so I can look at total spend.
Right now I am thinking something like this - where I have the data frame sorted by spend
if spend < 0 {
take absolute value of spend
if diff between abs(spend) and spend+1 = 0 then both are NA}
I would like to have something like
df[df$spend < 0] <- NA
where I can also set one positive counterpart to NA as well. Any suggestions?
There should be a simpler solution to this but here is one way. Also created my own example since the one shared did not have sufficient data points to test
#Original vector
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
#Count the frequency of negative numbers, keeping all the unique numbers
vals <- table(factor(abs(x[x < 0]), levels = unique(abs(x))))
#Count the frequency of absolute value of original vector
vals1 <- table(abs(x))
#Subtract the frequencies between two vectors
new_val <- vals1 - (vals * 2 )
#Recreate the new vector
as.integer(rep(names(new_val), new_val))
#[1] 1 2 3
If you add a rowid column you can do this with data.table ant-joins.
Here's an example which takes ID into account, not deleting "positive counterparts" unless they're the same ID
First create more interesting sample data
df <- fread('
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
123 A -19.50
123 A 19.50
123 A -99.50
124 A -94.50
124 A 99.50
124 A 94.50
124 A 94.50
')
Now remove all the negative values with positive counterparts, and remove those counterparts
negs <- df[Spent < 0][, Spent := -Spent][, rid := rowid(ID, Spent)]
pos <- df[Spent > 0][, rid := rowid(ID, Spent)]
pos[!negs, on = .(ID, Spent, rid), -'rid']
# ID Store Spent rid
# 1: 123 A 18.5 2
# 2: 124 A 99.5 1
# 3: 124 A 94.5 2
And as applied to Ronak's x vector example
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
negs <- data.table(x = -x[x<0])[, rid := rowid(x)]
pos <- data.table(x = x[x>0])[, rid := rowid(x)]
pos[!negs, on = names(pos), -'rid']
# x
# 1: 2
# 2: 3
# 3: 1
I used the following code.
library(dplyr)
store <- rep(LETTERS[1:3], 3)
id <- c(1:4, 1:3, 1:2)
expense <- runif(9, -10, 10)
tibble(store, id, expense) %>%
group_by(store) %>%
summarise(net_expenditure = sum(expense))
to get this output:
# A tibble: 3 x 2
store net_expenditure
<chr> <dbl>
1 A 13.3
2 B 8.17
3 C 16.6
Alternatively, if you wanted the net expenditure per store-id pairing, then you could use this code:
tibble(store, id, expense) %>%
group_by(store, id) %>%
summarise(net_expenditure = sum(expense))
I've approached your question from a slightly different perspective. I'm not sure that my code answers your question, but it might help.
In example we have:
treetype Leaves roots
1. 1 670 25
2. 4 330 55
3. 8 880 55
4. 3 770 25
Wanted solution:
New value
1. 696 (1+670+25)
2. 389 (4+330+55)
3. and so on
And second question:
if value doesn't meet required value, i.e. 550 then calculate new value using only leaves.
-> 2. row in wanted solution should be 330.
any tips?
This is a task which is suitable for the dplyr package, which is part of tidyverse:
library(tidyverse)
data <- data.frame(treetype = c(1,4,8,3),
Leaves = c(670, 330, 880, 770),
roots = c(25,55,55,25))
limit <- 550
data <- data %>%
dplyr::mutate(New_value = ifelse(treetype+Leaves+roots < limit, Leaves, treetype+Leaves+roots))
As an alternative, here is a solution using base R (no extra packages).
data <- data.frame(treetype = c(1,4,8,3),
Leaves = c(670, 330, 880, 770),
roots = c(25,55,55,25))
limit <- 550
data$limit_test <- data$treetype + data$Leaves + data$roots > limit
data$New_value[data$limit_test] <- (data$treetype + data$Leaves + data$roots)[data$limit_test]
data$New_value[!data$limit_test] <- (data$Leaves)[!data$limit_test]
data$limit_test <- NULL
I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)
I am struggling to iterate 2 loops over all the files in a folder. I have over 600 .csv files, which contain information about the latency and duration of saccades made in a sentence. They look like this:
order subject sentence latency duration
1 1 1 641 76
2 1 1 98 57
3 1 1 252 49
4 1 1 229 43
For each of the files, I want to create 2 new columns called Start and End, to calculate the start and end point of each saccade. The values in each of those are calculated from the values in the latency and duration columns. I can do this using a loop for each file, like so:
SentFile = read.csv(file.choose(), header = TRUE, sep = ",")
# Calculate Start
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] +
SentFile$Duration[i] + SentFile$Latency[i+1]}
#Calculate End
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}
And then the result looks like this:
order subject sentence latency duration Start End
1 1 1 641 76 641 717
2 1 1 98 57 815 872
3 1 1 252 49 1124 1173
4 1 1 229 43 1402 1445
I am sure there is probably a more efficient way of doing it, but it is very important to use the precise cells specified in the loop to calculate the Start and End values and that was the only way I could think of to get it to work for each individual file.
As I said, I have over 600 files, and I want to be able to calculate the Start and End values for the entire set and add the new columns to each file. I tried using lapply, like this:
sent_files = list.files()
lapply(sent_files, function(x){
SentFile = read.csv(x, header = TRUE, sep = ",")
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] + SentFile$Duration[i]
+ SentFile$Latency[i+1]}
#Calculate End of Saccade Absolute Time Stamp #######
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}})
However, I keep getting this error message:
Error in `$<-.data.frame`(`*tmp*`, "SacStart", value = c(2934L, NA)):replacement has 2 rows, data has 1
I would really appreciate any help in getting this to work!
First, replace for loops:
data <- data.frame(
"order" = c(1,2,3,4), subject = c(1,1,1,1), sentance = c(1,1,1,1), latency= c(641, 98, 252, 229), duration = c(76, 57, 49, 43)
)
data$end <- cumsum(data$latency + data$duration)
data$start <- data$end - data$duration
Secondly, you are not assigning results of the CSV load to your environment variable.
If you want to process all files in one go, change the code for data load to this:
data.list <- lapply(sent_files, function(x){
data <- read.csv(x, header = TRUE, sep = ",")
return(data)
})
data <- do.call("rbind", data.list)
I have a file with peoples ages, and want to subset age ranges (eg. under10, 35-44 etc).
Whilst age ranges of double digit numbers works fine using grep:
X_35_44 <- X[ grep("35|36|37|38|39|40|41|42|43|44", X$Age) , ]
When trying to subset for anything under 10 eg:
X_10under <- X[ grep("0|1|2|3|4|5|6|7|8|9|10|", X$Age) , ]
I am returned any age with a 1 in it (eg. 31) or a 2 or a 3, rather than just those numbers under 10.
How do I ensure that this doesn't happen?
Any help would be much appreciated!
Thanks in advance
Using the principle of not accepting failed code, but rather delivering a more effective coding solution, I'm going to disagree with the regex strategy and suggest you instead use cut or findInterval.
X <- data.frame(Ages = sample(1:85, 300, repl=TRUE))
X$age_cat <- cut(X$Age, c(0, 10, 45, 60, 75, Inf), labels=c("under10",
'10-44','45-59','60-74','75+'), right=FALSE, include.lowest=TRUE)
head(X)
#=========
Ages age_cat
1 65 60-74
2 34 10-44
3 19 10-44
4 79 75+
5 5 under10
6 51 45-59
A solution with
ifelse()
as.integer(df$age)
df$age_cat <- ifelse(df$age < 10, "age_0-10", ifelse(10 < df$age < 20, "age_10-20", "age_20-"))
Choose your own range ...