R How to iterate loops over every file in a folder? - r

I am struggling to iterate 2 loops over all the files in a folder. I have over 600 .csv files, which contain information about the latency and duration of saccades made in a sentence. They look like this:
order subject sentence latency duration
1 1 1 641 76
2 1 1 98 57
3 1 1 252 49
4 1 1 229 43
For each of the files, I want to create 2 new columns called Start and End, to calculate the start and end point of each saccade. The values in each of those are calculated from the values in the latency and duration columns. I can do this using a loop for each file, like so:
SentFile = read.csv(file.choose(), header = TRUE, sep = ",")
# Calculate Start
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] +
SentFile$Duration[i] + SentFile$Latency[i+1]}
#Calculate End
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}
And then the result looks like this:
order subject sentence latency duration Start End
1 1 1 641 76 641 717
2 1 1 98 57 815 872
3 1 1 252 49 1124 1173
4 1 1 229 43 1402 1445
I am sure there is probably a more efficient way of doing it, but it is very important to use the precise cells specified in the loop to calculate the Start and End values and that was the only way I could think of to get it to work for each individual file.
As I said, I have over 600 files, and I want to be able to calculate the Start and End values for the entire set and add the new columns to each file. I tried using lapply, like this:
sent_files = list.files()
lapply(sent_files, function(x){
SentFile = read.csv(x, header = TRUE, sep = ",")
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] + SentFile$Duration[i]
+ SentFile$Latency[i+1]}
#Calculate End of Saccade Absolute Time Stamp #######
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}})
However, I keep getting this error message:
Error in `$<-.data.frame`(`*tmp*`, "SacStart", value = c(2934L, NA)):replacement has 2 rows, data has 1
I would really appreciate any help in getting this to work!

First, replace for loops:
data <- data.frame(
"order" = c(1,2,3,4), subject = c(1,1,1,1), sentance = c(1,1,1,1), latency= c(641, 98, 252, 229), duration = c(76, 57, 49, 43)
)
data$end <- cumsum(data$latency + data$duration)
data$start <- data$end - data$duration
Secondly, you are not assigning results of the CSV load to your environment variable.
If you want to process all files in one go, change the code for data load to this:
data.list <- lapply(sent_files, function(x){
data <- read.csv(x, header = TRUE, sep = ",")
return(data)
})
data <- do.call("rbind", data.list)

Related

R Functions for Binding Data Frames within a List that Have Same Column Headers

I am pulling individual logs in that show changes in production tanks through an API. When trying to create on data frame with all of these logs I have been running into various issues. Below I have a section of my code:
Binded_TL<-do.call(rbind,TL_JSON_TEXT1)
TL_JSON <-purrr::map(Binded_TL, jsonlite::fromJSON)
TL_JSON2 <- TL_JSON[[1]]$Data
I have no issue with the above code, TL_JSON2 prints as a data frame with the correct headers but when I run TL_JSON2 as a for loop to try and combine them all:
for (i in 1:length(TL_JSON_TEXT1)){
TL_JSON2[[i]] <- as.data.frame(TL_JSON[[i]]$Data)
}
Is where I am running into an issue. Not sure if for loop is the way to go or if I should be doing something completely different.
I have tried the following:
TL_JSON2 <- data.frame()
for (i in 1:length(TL_JSON)){
TL_JSON2[[i]] <- paste0(TL_JSON[[i]]$Data)}
But I get the error of "replacement has 43 rows, data has 0"
Reproducible code
tank1 <- data.frame(TankName = c("tank1", "tank1", "tank1"), Capacity = c(100,100,100), PercentFull = c(10,13,20), Date = c("1/2/22", "1/3/22", "1/5/22"))
tank2 <- data.frame(TankName = c("tank2"), Capacity = c(200), PercentFull= c(50), Date = c("2/7/22"))
tank3 <- data.frame(TankName = c("tank3", "tank3"), Capacity = c(300, 300), PercentFull = c(80, 60), Date = c("1/3/22","1/6/22"))
Nested_DF <- list(tank1, tank2, tank3)
I have something similar to the Nested_DF and I am trying to create a combined df that looks like
TankName Capacity PercentFull Date
1 tank1 100 10 1/2/22
2 tank1 100 13 1/3/22
3 tank1 100 20 1/5/22
4 tank2 200 50 2/7/22
5 tank3 300 80 1/3/22
6 tank3 300 60 1/6/22

h2o deep learning NN 1 layer non reproducing X + Y = Z

I have just started using the h2o package in order to build a supervised NN network wit the deeplearning package.
In order to get a bit on track I started trying to simulate a function like X + Y = Z
My code is the following:
data <- read.table("DeepLearningTest.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
test <- read.table("DeepLearningTestRun.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
df <- data.frame (data)
tf <- data.frame (test)
h2o.init ()
hf <- as.h2o (df)
m2 <- h2o.deeplearning(
training_frame=hf,
x=0:1,
y='C',
hidden = c(100),
epochs=100000,
stopping_tolerance=0.001
)
h2o.predict (m2, as.h2o(tf))
Where my test samples are the following (for example):
1 1 2
2 2 4
3 3 6
4 4 8
. . .
until
2000 2000 4000
In general is X + X = 2X
The thing I am not understanding and for which I am writing is that, if I use one layer network (for the univeral approximation theorem should be sufficient)
I can traing the network and the predict quite good values in the range of the prediction.
for instance the network is giving me
100 100 200
101 101 202
7 7 14
but when I put
4000 4000
the result is misleading. It gives me something like 6300
It seems that the network is not able to generalized.
What am I doing wrong to make this behavior?
Thank you for you attention.
Regards,
Nicola

ifelse didn't work in dataframe in R

I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)

R : Create specific bin based on data range

I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107

Replacing Values in R with Latency Values

#BenBolker Here is an example of the output I would like; I have no idea if it can even be done.
CURRENT_FIX_START CURRENT_FIX_END identifier trialtype rotatedimaged
targetloc prefix corrfix errfix
7 202 30 rotated stimN11of2.jpg left 231 254 0
7 208 42 rotated stimN221of2.jpg left 451 245 0
241 761 78 rotated stimW131-of2.jpg right 345 345 0
For each trial, where there is a correct prefix, denoted by having a time stamp, and a correct corrfix, I would want the script to print everything on the same line (I need to to get a latency measure from prefix to corrfix). Within the old data, the time stamps occurred on different lines. I was thinking about doing this manually, but it would be far too time consuming.
Untested, but how about:
latency <- with(mydata,abs(CURRENT_FIX_START-CURRENT_FIX_END))
for (i in c("prefix","corrfix","errfix")) {
mydata[[i]] <- ifelse(mydata[[i]]==1,latency,0)
}
You'll need to adjust for appropriate indexing, but this seems to work:
#fake data
dat <- data.frame(fix1 = runif(10), fix2 = runif(10), prefix = sample(0:1, 10, TRUE),
corfix = sample(0:1, 10, TRUE), errfix = sample(0:1, 10, TRUE))
dat[, 3:5] <- apply(dat[, 3:5], 2, function(x) ifelse(x == 1, abs(dat$fix1 - dat$fix2),x))

Resources