Reducing data in data frame to plot data in R - r

I'm very new to programming so I apologise in advance for my lack of R know-how. I'm a PhD student interested in pupillometry and I have just recorded the pupil response of participants performing a listening task in two conditions (Easy and Hard). The pupil response interest period for each trial is around 20 seconds and I would like to be able to plot this data for each participant on R. The eyetracker sampling rate is 1000Hz and each participant completed 92 trials. So the data that I currently have for each participant includes close to 2million rows. I have tried to plot this using ggplot2 but, as expected, the graph is very cluttered.
I've been trying to work out a way of reducing the data so that I can plot it on R. Ideally, I would like to take the mean pupil size value for every 1000 samples (i.e. 1 second of recording) averaged across all 92 trials for each participant. With this information, I would then create a new dataframe for plotting the average slope from 1-20 seconds for the two listening conditions (Easy and Hard).
Here is the current structure of my data frame;
> str(ppt53data)
'data.frame': 1915391 obs. of 6 variables:
$ RECORDING_SESSION_LABEL: Factor w/ 1 level "ppt53": 1 1 1 1 1 1 1 1 1 1 ...
$ listening_condition : Factor w/ 2 levels "Easy","Hard": 2 2 2 2 2 2 2 2 2 2 ...
$ RIGHT_PUPIL_SIZE : Factor w/ 3690 levels ".","0.00","1000.00",..: 3266 3264 3263 3262 3262 3260 3257 3254 3252 3252 ...
$ TIMESTAMP : num 262587 262588 262589 262590 262591 ...
$ TRIAL_START_TIME : int 262587 262587 262587 262587 262587 262587 262587 262587 262587 262587 ...
$ TrialTime : num 0 1 2 3 4 5 6 7 8 9 ...
- attr(*, "na.action")=Class 'omit' Named int [1:278344] 873 874 875 876 877 878 879 880 881 882 ...
.. ..- attr(*, "names")= chr [1:278344] "873" "874" "875" "876" ...
The 'TrialTime' variable specifies the sample (i.e. millisecond) in each trial. Can anyone advise me about which step I should take next? I figure it would make sense to arrange my data into separate data frames which would allow me to calculate the mean values that I want (across trials and for every 1000 samples). However, I'm not sure what is the most efficient/best way of doing this.
I'm sorry that I can't be any more specific. Any rough guidance would be much appreciated.

I think for such a large block of data with many aggregation levels you will need to use data.table. I may have mis-structured your data, but hopefully this will give you the idea:
require(data.table)
require(ggplot2)
#100 patient * 20,000 observations (1-20,000 ms)
ppt53data<-data.frame(
RECORDING_SESSION_LABEL=paste0("pat-",rep(1:100,each=20000)), #patients
listening_condition=sample(c("Easy","Hard"),2000000,replace=T), #Easy/Hard
RIGHT_PUPIL_SIZE=rnorm(2000000,3000,500), #Pupil Size
TrialTime=rep(1:20000,100) #ms from start
)
# group in 1000ms blocks
ppt53data$group<-cut(ppt53data$TrialTime,c(0,seq(1000,20000,1000),Inf))
unique(ppt53data$group)
#convert frame to table
dt.ppt53data<-data.table(ppt53data)
#index
setkey(dt.ppt53data, RECORDING_SESSION_LABEL, group)
#create data.frame of aggregated plot data
plot.data<-data.frame(dt.ppt53data[,list(RIGHT_PUPIL_SIZE=mean(RIGHT_PUPIL_SIZE)),by=list(group)])
#plot with ggplot2
ggplot(plot.data)+geom_bar(aes(group,RIGHT_PUPIL_SIZE,stat="identity",fill=group)) +
theme(axis.text.x=element_text(angle=-90))+
coord_cartesian(ylim=c(2995,3005))

Some rough guidance:
library(plyr)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(RIGHT_PUPIL_SIZE))
This tells it to calculate the mean size of the right pupil for each unique TrialTime. Perhaps seeing how this works would help you figure out how to describe what you need?
Assuming that within each TrailTime there are more than 1000 observations, you can randomly select:
set.seed(42)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(sample(RIGHT_PUPIL_SIZE,1000)))

Related

Time Series application - Guidance Needed

I am relatively new to R, and am currently trying to implement time series on a data set to predict product volume for next six months. My data set has 2 columns Dates(-timestamp) and volume of product in inventory (on that particular day) for example like this :
Date Volume
24-06-2013 16986
25-06-2013 11438
26-06-2013 3378
27-06-2013 27392
28-06-2013 24666
01-07-2013 52368
02-07-2013 4468
03-07-2013 34744
04-07-2013 19806
05-07-2013 69230
08-07-2013 4618
09-07-2013 7140
10-07-2013 5792
11-07-2013 60130
12-07-2013 10444
15-07-2013 36198
16-07-2013 11268
I need to predict six months of product volume required in inventory after end date(in my data set which is "14-06-2019" "3131076").Approx 6 year of data I am having start date 24-06-2013 and end date 14-06-2019
I tried using auto.arima(R) on my data set and got many errors. I started researching on the ways to make my data suitable for ts analysis and came to know about imputets and zoo packages.
I guess date has high relevance for inputting frequency value in the model so I did this : I created a new column and calculated the frequency of each weekday which is not the same
data1 <- mutate(data, day = weekdays(as.Date(Date)))
> View(data1)
> table(data1$day)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
213 214 208 207 206 211 212
There are no missing values against dates but we can see from above that count of each week day is not the same, some of the dates are missing, how to proceed with that ?
I have met kind of dead end , tried going through various posts here on impute ts and zoo package but didn't get much success.
Can someone please guide me how to proceed further and pardon me #admins and users if you think its spamming but it is really important for me at the moment. I tried to go through various tutorials on Time series out side but almost all of them have used air passengers data set which I think has no flaws.
Regards
RD
library(imputeTS)
library(dplyr)
library(forecast)
setwd("C:/Users/sittu/Downloads")
data <- read.csv("ts.csv")
str(data)
$ Date : Factor w/ 1471 levels "01-01-2014","01-01-2015",..: 1132 1181 1221 1272 1324 22 71 115 163 213 ...
$ Volume: Factor w/ 1468 levels "0","1002551",..: 379 116 840 706 643 1095 1006 864 501 1254 ...
data$Volume <- as.numeric(data$Volume)
data$Date <- as.Date(data$Date, format = "%d/%m/%Y")
str(data)
'data.frame': 1471 obs. of 2 variables:
$ Date : Date, format: NA NA NA ... ## 1st Error now showing NA instead of dates
$ Volume: num 379 116 840 706 643 ...
Let's try to generate that dataset :
First, let's reproduce a dataset with missing data :
dates <- seq(as.Date("2018-01-01"),as.Date("2018-12-31"),1)
volume <- floor(runif(365, min=2500, max=50000))
dummy_df <- do.call(rbind, Map(data.frame, date=dates, Volume=volume))
df <- dummy_df %>% sample_frac(0.8)
Here we generated a dataframe with Date and volume for the year 2018, with 20%missing data (sample_frac(0.8)).
This should mimic correctly your dataset with missing data for some days.
What we want from there is to find the days with no volume data :
Df_full_dates <- as.data.frame(dates) %>%
left_join(df,by=c('dates'='date'))
Now you want to replace the NA values (that correspond to days with no data) with a volume (I took 0 there but if its missing data, you might want to put the month avg or a specific value, I do not know what suits best your data from your sample) :
Df_full_dates[is.na(Df_full_dates)] <- 0
From there, you have a dataset with data for each day, you should be able to find a model to predict the volume in future months.
Tell me if you have any question

add value to column within a data frame only for a subset

I have a data frame, called APD, and I would like to assign a value to the column "Fitted_voltage", but only for a specific subset (grouped by serial_number). How do I do that?
In the following example I want to assign 150 for the Fitted_Voltage but only for the Serial_number 913009814.
Serial_number Lot Wafer Amplification Voltage Fitted_Voltage
912009913 9 912 1878 375.3 NA
912009913 9 912 1892 376.8 NA
912009913 9 912 1900 377.9 NA
812009897 8 812 3931.1 370.5 NA
812009897 8 812 3934.8 371 NA
812009897 8 812 3939.9 372.3 NA
...
...
Finally I would like to do this automatically. I fit some data points and want to assign to each serial_number the fitted result.
The process could be:
Fit via function function_to_observe and do point-wise inverse regression at a specific value of 150 for serial number 912009913:
function_to_observe(150)
This yields the result
[1] 360.6395
which shall be stored in the data frame in the column Fitted_Voltage for one single serial_number
Then the next serial_number 812009897 will be fitted and this value shall be stored for it and again and again..
I know I can add the value to the column, but not limited to the subset:
APD["Fitted_Voltage"] <- Fitted_voltage<- function_to_observe(150)
Update: According to Eric Lecoutre answer I have now:
ID<- 912009913
ID2<- 912009914
APD_result<- data.frame(Serial_Number=rep(c(ID, ID2),each=1), Fitted_Voltage=NA)
comp <- tapply(APD_result$Fitted_Voltage, APD_result$Serial_Number, function_to_observe = inverse((function(x=150) exp(exp(sum(x^0*VK[1],x^1*VK[2],x^2*VK[3],x^3*VK[4])))), 50, 375))
APD_result$Fitted_Voltage = comp[APD_result$Serial_Number]
This works very well but I need to apply some minor changes. Which are not so minor for me..
1.) The Serial_numbers have to be added automatically (given as two examples "ID, ID2")
2.) I do not get tapply to run since I removed Voltage. Sorry for not specifying this in my previous question. The voltage is not of interest, I only want Serial_number and Fitted_Voltage in the end frame, which belong to each other.
Not so clear for me what your function_to_observe does. I assume you "exploits" the set of Voltage values for a given Serial_Number.
I prepared a small function that does so having an additional argument (value).
Does the following answer your question?
df <- data.frame(Serial_Number=rep(c("a","b"),each=3),Voltage=abs(100*rnorm(6)), FittedVoltage=NA)
function_to_observe <- function(vec,value=150) {mean(vec)+value}
comp <- tapply(df$Voltage, df$Serial_Number, function_to_observe, value=150)
df$FittedVoltage = comp[df$Serial_Number]
Having as
result
Serial_Number Voltage FittedVoltage
1 a 21.01196 205.4419
2 a 37.04815 205.4419
3 a 108.26565 205.4419
4 b 121.37657 264.3040
5 b 39.92053 264.3040
6 b 181.61485 264.3040
(yeah I know fitted voltage here is totally unrelated to voltage... Just does not understand what your 150 does here)

Combining and Filtering Daily Data Sets in R

I am currently trying to find the most efficient way to use a large collection of daily transaction data sets. Here is an example of one day's data set:
Date Time Name Price Quantity
2005/01/03 012200 Dave 1.40 1
2005/01/03 012300 Anne 1.35 2
2005/01/03 015500 Steve 1.54 1
2005/01/03 021500 Dave 1.44 15
2005/01/03 022100 James 1.70 7
In the real data, there are ~40,000 rows per day, and each day is a separate comma-delimited .txt file. The data go from 2005 all the way to today. I am only interested in "Dave" and "Anne," (as well as 98 other names) but there are thousands of other people in the set. Some days may have multiple entries for a given person, some days may have none for a given person. Since there is a large amount of data, what would be the most efficient way of extracting and combining all of the data for "Anne," "Dave," and the other 98 individuals (Ideally into 100 separate data sets)?
The two ways I would think off are:
1) filtering each day to only "Dave" or "Anne" and then appending to one big data set.
2) Appending all days to one big data set and the filtering to "Dave" or "Anne."
Which method would give me the most efficient results? And is there a better method that I can't think of?
Thank you for the help!
Andy
I believe the question can be answered analytically.
Workflow
As #Frank had pointed it out, the method may depend on the processing requirements:
Is this a one-time exercise?
Then the feasibility of both methods can be further investigated.
Is this a repetitive task where the actual daily transaction data should be added?
Then method 2 might be less efficient if it processes the whole bunch of data anew at every repetition.
Memory requirement
R keeps all data in memory (unless one of the special "big memory" packages is used). So, one of the constraints is the available memory of the computer system used for this task.
As already pointed out in brittenb's comment there are 12 years of daily data files summing up to a total of 12 * 365 = 4380 files. Each file contains about 40 k rows.
The 5 rows sample data set provided in the question can be used to create a 40 k rows dummy file by replication:
library(data.table)
DT <- fread(
"Date Time Name Price Quantity
2005/01/03 012200 Dave 1.40 1
2005/01/03 012300 Anne 1.35 2
2005/01/03 015500 Steve 1.54 1
2005/01/03 021500 Dave 1.44 15
2005/01/03 022100 James 1.70 7 ",
colClasses = c(Time = "character")
)
DT40k <- rbindlist(replicate(8000L, DT, simplify = FALSE))
str(DT40k)
Classes ‘data.table’ and 'data.frame': 40000 obs. of 5 variables:
$ Date : chr "2005/01/03" "2005/01/03" "2005/01/03" "2005/01/03" ...
$ Time : chr "012300" "012300" "012300" "012300" ...
$ Name : chr "Anne" "Anne" "Anne" "Anne" ...
$ Price : num 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 ...
$ Quantity: int 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Name"
print(object.size(DT40k), units = "Mb")
1.4 Mb
For method 2, at least 5.9 Gb (4380 * 1.4 Mb) of memory is required to hold all rows (unfiltered) in one object.
If your computer system is limited in memory then method 1 might be the way to go. The OP has mentioned that he is only interested to keep the transaction data of just 100 names out of several thousand. So after filtering, the data volume finally might be reduced to 1% to 10% of the original volume, i.e., to 60 Mb to 600 Mb.
Speed
Disk I/O is usually the performance bottleneck. With the fast I/O functions included in the data.table package we can simulate the time needed for reading all 4380 files.
# write file with 40 k rows
fwrite(DT40k, "DT40k.csv")
# measure time to read the file
microbenchmark::microbenchmark(
fread = tmp <- fread("DT40k.csv", colClasses = c(Time = "character"))
)
Unit: milliseconds
expr min lq mean median uq max neval
fread 34.73596 35.43184 36.90111 36.05523 37.14814 52.167 100
So, reading all 4380 files should take less than 3 minutes.
IMO, and if storage space is not an issue, you should go with option 2. This gives you a lot more flexibility in the long run (say you want to add / remove names in the future).
Always easier to trim the data rather than regret not collecting it. The only reason I would go with option 1 is is storage or speed is a bottleneck in your workflow.

R subset function returns zero records, for unexplained reason

I must be missing something very basic. Hope someone can point it out. I'm trying to subset the following data frame based on a specific year and sex...
str(Bnames)
'data.frame': 258000 obs. of 4 variables:
$ X.year. : int 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
$ X.name. : Factor w/ 6782 levels "\"Aaden\"","\"Aaliyah\"",..: 3380 6632 3125 1174 2554 2449 3428 6232 2834 5517 ...
$ X.percent.: num 0.0815 0.0805 0.0501 0.0452 0.0433 ...
$ X.sex. : Factor w/ 2 levels "\"boy\"","\"girl\"": 1 1 1 1 1 1 1 1 1 1 ...
The code I have entered is
one <- subset(Bnames, X.year.==2008 & X.sex.=="boy") # I get zero rows returned
two<- subset(Bnames, X.year.==2008) # I get 2000 rows returned, which is correct
three <- subset(Bnames, X.sex.=="boy") # I get 0 rows returned
four <- subset(Bnames, X.name.=="John") # I get 0 rows returned
I don't understand. I'm using a data set that is freely available at http://plyr.had.co.nz/09-user/
If I make my own data frame by repeat sampling of c("boy","girl"), the subset works fine. Why is the code failing with the data that I started with?
The reason you are getting 0 results is that the levels of your factor columns are quoted. For instance, X.sex. column levels are not boy or girl, but rather "boy" and "girl". This may due to the fact that the file you have imported your data.frame from had fields quoted and it was read through read.table (or some other equivalent function) with the quote=FALSE argument. If that's the case, you could easily re-read the file and correct this rather annoying feature.
Anyway, to proper subset your data.frame remember the quotes. For instance:
one <- subset(Bnames, X.year.==2008 & X.sex.=="\"boy\"")
Alternatively, you may use the ' as quote:
one <- subset(Bnames, X.year.==2008 & X.sex.=='"boy"')
If you want to get rid of the annoying quotes without having to rebuild your data.frame, just try:
Bnames[,4]<-factor(gsub(Bnames[,4],'"',""))

Regarding creating a data set in accordance with a given data format

I am learning to use topicmodels package and R as well, and explored one of its example data set by using
str(testdata)
'data.frame': 3104 obs. of 5 variables:
$ Article_ID: int 41246 41257 41268 41279 41290 41302 41314 41333 41344 41355 ...
$ Date : chr "1-Jan-96" "2-Jan-96" "3-Jan-96" "4-Jan-96" ...
$ Title : chr "Nation's Smaller Jails Struggle To Cope With Surge in Inmates" "FEDERAL IMPASSE SADDLING STATES WITH INDECISION" "Long, Costly Prelude Does Little To Alter Plot of Presidential Race" "Top Leader of the Bosnian Serbs Now Under Attack From Within" ...
$ Subject : chr "Jails overwhelmed with hardened criminals" "Federal budget impasse affect on states" "Contenders for 1996 Presedential elections" "Bosnian Serb leader criticized from within" ...
$ Topic.Code: int 12 20 20 19 1 19 1 1 20 15 ...
If I want to create a data set according to the above format in R, how to do that?
test.data is a data.frame, one of the few fundamental R objects. You should probably start here: http://cran.r-project.org/doc/manuals/R-intro.pdf.
Some functions for creating data.frames are data.frame, read.table, read.csv. For each of these you can access their documentation by typing ?data.frame for example. Good luck.

Resources