delete row by subset - r

I have a data.frame with 2 columns dates=dates of observations for each station and values=observation data
> head(dataset.a)
dates values
1 1976-01-01 7.5
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 10.2
(...)
I have to multiply each row by a value that I have already from another data.frame:
> head(dataset.b)
dates values
1 1976-01-01 0.23
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 1.23
(...)
Both datasets contain the Gregorian Calendar, however the dataset.a contains
Leap years (adds a 29th day to February) and the dataset.b has always 28 days in February. I want to ignore all 29th days of February in dataset.a and make the multiplication.
I should be able to make a basic subset using both indices:
which(strftime(dataset.a[,1],"%d")!="29")
which(strftime(dataset.a[,1],"%m")!="02")
However once I add a logical AND I loose the position in the data.frame were I have YEAR-02-29 and he returns me the number of rows that are TRUE for the combination of both indices.
I guess this is a very basic question, but I am lost.

Try a logical index:
idx <- strftime(ws.hb1.dataset[d,1],"%d")!="29" & strftime(ws.hb1.dataset[d,1],"%m")!="02"
Note: I'm assuming ws.hb1.dataset[d,1] is basically dataset.a[,1] here?
Then you'll get a vector of TRUE TRUE ... TRUE FALSE TRUE TRUE .. with the FALSE coinciding with 29/Feb.
Then you can just do dataset.a[idx,] to get the non 29/Feb dates.

Related

Reformat wrapped data coerced into a dataframe? (R)

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

Create dataframe with missing data

I'm very new to R, so please excuse my potentially noob question.
I have data from 23 individuals of hormone concentrations collected hourly - I've interpolated between hourly collections to get concentrations between 2.0 - 15pg/ml at intervals of 0.1 : this equals to 131 rows of data per individual.
Some individials' concentrations, however, don't go beyond 6.0 pg/ml (for example) which means I have dataframes of unequal number of rows across individials. I need all individuals to have 131 rows for the next step where I combine all the data.
I've tried to create a dataframe of NAs with 131 rows and two coloumns, and then add the individual's interplotated data into the NA dataframe - so that the end result is a 131 row data from with missing data as NA - but it's not going so well.
interp_saliva_002_x <- as.tibble(matrix(, nrow = 131, ncol = 1))
interp_sequence <- as.numeric(seq(2,15,.1))
interp_saliva_002_x[1] <- interp_sequence
colnames(interp_saliva_002_x)[1] <- "saliva_conc"
test <- left_join(interp_saliva_002_x, interp_saliva_002, by "saliva_conc")
Can you help me to understand where I'm going wrong or is there a more logical way to do this?
Thank you!
Lets assume you have 3 vectors with different lengths:
A<-seq(1,5); B<-seq(2,8); C<-seq(3,5)
Change the length of the vectors to the length that you want (in your case it's 131, I picked 7 for simplicity):
length(A)<-7; length(B)<-7; length(C)<-7 #this replaces all the missing values to NA
Next you can cbind the vectors to a matrix:
m <-cbind(A,B,C)
# A B C
#[1,] 1 2 3
#[2,] 2 3 4
#[3,] 3 4 5
#[4,] 4 5 NA
#[5,] 5 6 NA
#[6,] NA 7 NA
#[7,] NA 8 NA
You can also change your matrix to a dataframe:
df<-as.data.frame(m)

Impute NA values with previous value in R

We have 101 variables (companys) their closing prices. We got a lot of NA values (because the stock market closes on saturdays and sundays -> gives NA value in our data) and we need to impute those NA values with the previous value if there is a previous value but we don't succeed. This is our data example
There are also companies that don't have data in the first years since they were not on the stock market so they have NA values for this period. And there are companies that go bankrupt and start having NA values so these should both become 0.
How should we do this since we have several conditions for filling our NA's
Thanks in advance.
My understanding of the rules are:
columns that are all NA are to be left as all NA
leading NA values are left as NA
interior NA values are replaced with the most recent non-NA values
trailing NA values are replaced with 0
To try this out we use the built-in data frame BOD replacing the 1st, 3rd and last rows with NA and adding a column of NA values -- see Note at end.
We define a logical vector ok having one element per column which is TRUE for columns having at least one element that is not NA and FALSE for other columns. Then operating only on the columns for which ok is TRUE we fill in the trailing NA values with 0 using na.fill. Then we use na.locf to fill in the interior NA values.
library(zoo)
ok <- !apply(is.na(BOD), 2, all)
BOD[, ok] <- na.locf(na.fill(BOD[, ok], c(NA, NA, 0)), na.rm = FALSE)
giving:
Time demand X
1 NA NA NA <-- leading NA values are left intact
2 2 10.3 NA
3 2 10.3 NA <-- interior NA values are filled in with last non-NA value
4 4 16.0 NA
5 5 15.6 NA
6 0 0.0 NA <- trailing NA values are filled in with 0
Note
We used the following input above:
BOD[c(1, 3, 6), ] <- NA
BOD <- cbind(BOD, X = NA)
Update
Fix.

How to remove NA data in only one columns?

I have a file that looks like so:
date A B
2014-01-01 2 3
2014-01-02 5 NA
2014-01-03 NA NA
2014-01-04 7 11
If I use newdata <- na.omit(data) where data is the above table loaded via R, then I get only two data points. I get that since it will filter all instances of NA. What I want to do is to filter for each A and B so that I get three data points for A and only two for B. Clearly, my main data set is much larger than that and the numbers are different but neither should not matter.
How can I achieve that?
Use is.na() on the relevant vector of data you wish to look for and index using the negated result. For exmaple:
R> data[!is.na(data$A), ]
date A B
1 2014-01-01 2 3
2 2014-01-02 5 NA
4 2014-01-04 7 11
R> data[!is.na(data$B), ]
date A B
1 2014-01-01 2 3
4 2014-01-04 7 11
is.na() returns TRUE for every element that is NA and FALSE otherwise. To index the rows of the data frame, we can use this logical vector, but we want its converse. Hence we use ! to imply the opposite (TRUE becomes FALSE and vice versa).
You can restrict which columns you return by adding an index for the columns after the , in [ , ], e.g.
R> data[!is.na(data$A), 1:2]
date A
1 2014-01-01 2
2 2014-01-02 5
4 2014-01-04 7
Every column in a data frame must have the same number of elements, that is why NAs come in handy in the first place...
What you can do is
df.a <- df[!is.na(df$A), -3]
df.b <- df[!is.na(df$B), -2]
In case of Python we can use subset to define column/columns and inplace true is to make the changes in DF:-
rounds2.dropna(subset=['company_permalink'],inplace=True)

Substracting time data exceeding 24h in R

My data consists of time points, in hours, starting from the start point of the experiment.
Experiments usually take over a week, so the the amount of hours easily exceeds 24.
To be precise, data is in the following format:
162:43:33.281
hhh:mm:ss.msecs
at the start of the experiment data points could consist of just 1-2 values for the hour insetad of the 3 mentioned here.
When I try to substract 2 times points I get an error stating that the numerical expression has for exemple 162:43 elements, which obviously refers to the colon used in the time annotation.
Any ideas on how to be able to treat time variables that consist of hour values over 24?
I tried the strptime function, with %H as argument, but that limits me to 24 hours.
Here is some example data:
V1 V2 V3 V4 V5
75:45:32.487 NA 17 ####revFalsePoke is 112 TRUE
75:45:32.487 NA 17 ####totalwindow is 5 TRUE
75:46:32.713 NA 1 ####Criteria not met TRUE
75:46:49.846 NA 6 ####revCorrectPoke is 37 TRUE
75:46:52.336 NA 9 ####revDeliberateLick is 34 TRUE
75:46:52.351 NA 9 ####totalwindow is 5 TRUE
75:46:52.598 NA 1 ####Criteria not met TRUE
75:47:21.332 NA 6 ####revCorrectPoke is 38 TRUE
75:47:23.440 NA 9 ####revDeliberateLick is 35 TRUE
75:47:23.455 NA 9 ####totalwindow is 6 TRUE
75:47:23.657 NA 1 ####rev Criteria not met TRUE
75:47:44.731 NA 17 ####revFalsePoke is 113 TRUE
75:47:44.731 NA 17 ####totalwindow is 6 TRUE
Unfortunately, you're going to have to roll your own converter function for this. I suggest converting the timestamps to difftime objects (which represent time duration, rather than location). You can then add them to some starting datetime to arrive at a final datetime for each timestamp. Here's one approach:
f <- function(start, timestep) {
result <- mapply(function(part, units) as.difftime(as.numeric(part), units=units),
unlist(strsplit(timestep, ':')),
c('hours', 'mins', 'secs'),
SIMPLIFY=FALSE)
start + do.call(sum, result)
}
start <- as.POSIXct('2013-1-1')
timesteps <- c('162:43:33.281', '172:34:28.33')
lapply(timesteps, f, start=start)
# [[1]]
# [1] "2013-01-07 18:43:33.280 EST"
#
# [[2]]
# [1] "2013-01-08 04:34:28.32 EST"

Resources