Add a vector as a single observation to a data.frame - r

I'm trying to save a number of spectral measurements in a data.frame. Each measurement has a number of attributes as well as two channels of spectral data, each with 2048 data points. I would like to have each channel be a single point of data in the data frame.
Something like this:
timestamp type integration channel1 channel2
1 2011-10-02 02:00:01 D 2000 (spec) (spec)
2 2011-10-02 02:00:07 D 2000 (spec) (spec)
Where each (spec) is a vector of 2048 values. I simply cannot get it to work, and I now turn to you guys for help.
Thanks in advance.

You can add matrix as one of data.frame fields, so you have to put all vectors as matrix rows.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- matrix(rnorm(3*2048), nrow=3)
DF$channel2 <- matrix(rnorm(3*2048), nrow=3)
ncol(DF)# == 5

I think what you want is doable but I may not be fully understanding your question. Heed Joris's suggestion though as this may be a better way of storing your data. You can accomplish what you want by storing the vectors of 2048 values in a list that you then add to the data frame as a column. Your table wasn't easily imported (for me anyway) with read.table so I made up my own data frame and example.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))
DF$channel2 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))

Related

How does one hard code data into a data frame?

below is the sample data and the manipulation. One will notice that in Month1 for each indcode that there is an NA for the empprevmonth and therefore empprevmonthchg. How would one hard code data into these columns. Yes, I know that there is a limit to the data hence the NA but what if I did want to manually input numbers after the fact? Can this be done?
periodyear3 <-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
month3<-c(1,2,3,4,5,6,1,2,3,4,5,6)
indcode3<-c(624410,624410,624410,624410,624410,624410,72,72,72,72,72,72)
employment3 <-c(25,25,26,27,28,29,85,86,87,88,89,90)
wages3 <-c(10000,10001,10002,10003,10004,10005,12510,12515,12520,12520,16528,19874)
example <- data.frame (periodyear3,month3,indcode3,employment3,wages3)
example<- example%>%
group_by(indcode3)%>%
mutate(empprevmonth=lag(employment3,1),
empprevmonthchg=(employment3-empprevmonth))
In the larger data frame away from here, the complication is that we have monthly data from 2012-12-01 to 2021-07-01. In the larger data set, there is an NA for empprevmonth in 2012-12-01. That makes sense. Now because there is an NA in the first row, there is an NA in the second (2013-01-01). It is the second row that I need to force the data into the empprevmonth and empprevmonthchg columns.
We could change the default value in lag i.e. NA to a different one so as to differentiate
library(dplyr)
example <- example%>%
group_by(indcode3)%>%
mutate(empprevmonth = lag(employment3,1, default = -999),
empprevmonthchg=(employment3-empprevmonth))

Aggregating functions which operate on non-data frame objects in R

I have a simple question. The aggregate() function in R operates on a dataframe based on the conditions specified.
aggregate(my.data.frame, list(desired column), function to be applied) is the default usage.
It is useful to compute simple functions like mean and median of a dataframe's column specific values. What I have, though, is a function which doesn't operate on dataframes, but I need to aggregate my dataframe after performing this function on a specific column. Let me show the dataset:
GPS Dataset
So I need to compute the centroid for the longitude and latitude points for EACH BSSID, I need to aggregate it that way. The functions I found online from various packages compute the centroid for a matrix of values and not a dataframe, whereas aggregate() doesn't work on non-dataframes.
Many thanks in advance :)
Aggregate works fine on matrices (and not just data frames).
Here's a reproducible example of your problem, using a matrix instead of a data frame:
my_matrix <- matrix(c(100,100,200,200,11,22,33,44,-1,-2,3,-4),
nrow=4,ncol=3,
dimnames=list(c(1,2,3,4),c('BSSID','lat','long')))
> my_matrix
BSSID lat long
1 100 11 -1
2 100 22 -2
3 200 33 -3
4 200 44 -4
> aggregate(cbind(lat,long)~BSSID,my_matrix,mean)
BSSID lat long
1 100 16.5 -1.5
2 200 38.5 -3.5
So that would be the mean (or the centroid) of the latitudes and longitudes for each BSSID. The cbind function (column-bind) lets you select multiple variables to be aggregated, similar to an Excel Pivot Table.
If still in doubt, you can always convert matrices to data-frames by using the as.data.frame() function and revert back to matrices using as.matrix() if needed.
I like dplyr for this - the syntax looks nice to me.
my.data.frame %>%
group_by(bssid) %>%
summarise(centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
If myfunction is fast, then this will work, but if it is slow, you probably want to rework it so that you only call the function once per bssid.
Edit to show alternative method without %>% operator
grouped.data.frame = group_by(my.data.frame, bssid)
summarised.data.frame = summarise(grouped.data.frame,
centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
The %>% operator takes the left hand side, and passes it as the first argument to the right hand side. It's useful for chaining your statements together without getting confused by hundreds of nested brackets. It makes things easier to read, in my opinion.

Select multiple observations in a matrix based on a specific condition

I am very new to the R interface but need to use the program in order to run the relevant analyses for my clinical doctorate thesis. So, apologies in advance if this is a novice question.
I have a matrix of beta methylation values with the following dimensions:485577x894. The row names of the matrix refer to cpg sites which range in non-numerical and non-ascending order (e.g. "cg00000029" "cg00000108" "cg00000109" "cg00000165"), while the column names refer to participant IDs which are also in non-numerical and non-ascending order (e.g. "11209" "14140" "1260" "5414").
I would like to identify which beta methylation values are > 0.5 so that I can exclude them from further analyses. In doing so, I need the data to stay in a matrix format. All attempts I have made to conduct this analysis have resulted in retrieval of integer variables rather than the data in a matrix format.
I would be so grateful if someone could please advise me of the code to conduct this analysis.
Thank you for your time.
Cheers,
Alicia
set.seed(1) # so example is reproduceable
m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6]
m[m>0.5]<-NA # anything > 0.5 set to NA
z <- na.omit(m) # remove all rows with any NA's

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources