How does one hard code data into a data frame? - r

below is the sample data and the manipulation. One will notice that in Month1 for each indcode that there is an NA for the empprevmonth and therefore empprevmonthchg. How would one hard code data into these columns. Yes, I know that there is a limit to the data hence the NA but what if I did want to manually input numbers after the fact? Can this be done?
periodyear3 <-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
month3<-c(1,2,3,4,5,6,1,2,3,4,5,6)
indcode3<-c(624410,624410,624410,624410,624410,624410,72,72,72,72,72,72)
employment3 <-c(25,25,26,27,28,29,85,86,87,88,89,90)
wages3 <-c(10000,10001,10002,10003,10004,10005,12510,12515,12520,12520,16528,19874)
example <- data.frame (periodyear3,month3,indcode3,employment3,wages3)
example<- example%>%
group_by(indcode3)%>%
mutate(empprevmonth=lag(employment3,1),
empprevmonthchg=(employment3-empprevmonth))
In the larger data frame away from here, the complication is that we have monthly data from 2012-12-01 to 2021-07-01. In the larger data set, there is an NA for empprevmonth in 2012-12-01. That makes sense. Now because there is an NA in the first row, there is an NA in the second (2013-01-01). It is the second row that I need to force the data into the empprevmonth and empprevmonthchg columns.

We could change the default value in lag i.e. NA to a different one so as to differentiate
library(dplyr)
example <- example%>%
group_by(indcode3)%>%
mutate(empprevmonth = lag(employment3,1, default = -999),
empprevmonthchg=(employment3-empprevmonth))

Related

Omitting NAs from Data

First time posting. Apologies if I'm not as clear as I intend.
I have an excel (xlxs) spreadsheet of data; it's sequencing data if that helps. Generally indexed as follows:
column 1 = organism families (hundreds of organisms down this column)
columns 2-x = specific samples
Many of the boxes scattered throughout the data are zero values, or too low, which I want to omit. I set my data such that anything under 5 is set to an NA. Since different samples will have many more, less, or different species omitted by that threshold, I want to separate by samples. Code so far is:
#Files work, I just omitted my directories to place online
`my_counts <- read_excel("...Family_120821.xlsx" , sheet = "family_Counts")
my_perc <- read_excel("...Family_120821.xlsx" , sheet = "family_Percentages")
my_counts[my_counts < 5] <- NA
my_counts
my_perc[my_perc < 0.05] <- NA
my_perc
S13 <- my_counts$family , my_counts$Sample.13
S13A <- na.omit(S13)
S13A
S14 <- my_counts$Sample.14
S14A <- na.omit(S14)
S14A
S15 <- my_counts$Sample.15
S15A <- na.omit(S15)
S15A
...
First question, there a better way I can go about this such that I can replicate it in different data without typing out each individual sample?
Most important question: When I do this, I get what I want, which is the values I want, no NAs. But they are values, when I want another dataframe so I can write it back to an xlxs. As I have it, I lose the association to the organism.
Ex: Before
All samples by associated organisms
Ex: After
Single sample, no NAs, but also no association to organism index
Essentially the following image, but broken into individual samples. With only the organisms that met my threshold of 5 for counts, 0.05 for percents.
enter image description here

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

omitting certain data in R to maintain overall data integrity

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.
As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Add a vector as a single observation to a data.frame

I'm trying to save a number of spectral measurements in a data.frame. Each measurement has a number of attributes as well as two channels of spectral data, each with 2048 data points. I would like to have each channel be a single point of data in the data frame.
Something like this:
timestamp type integration channel1 channel2
1 2011-10-02 02:00:01 D 2000 (spec) (spec)
2 2011-10-02 02:00:07 D 2000 (spec) (spec)
Where each (spec) is a vector of 2048 values. I simply cannot get it to work, and I now turn to you guys for help.
Thanks in advance.
You can add matrix as one of data.frame fields, so you have to put all vectors as matrix rows.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- matrix(rnorm(3*2048), nrow=3)
DF$channel2 <- matrix(rnorm(3*2048), nrow=3)
ncol(DF)# == 5
I think what you want is doable but I may not be fully understanding your question. Heed Joris's suggestion though as this may be a better way of storing your data. You can accomplish what you want by storing the vectors of 2048 values in a list that you then add to the data frame as a column. Your table wasn't easily imported (for me anyway) with read.table so I made up my own data frame and example.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))
DF$channel2 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))

Resources