How to workaround NAs in subscripted assignments - r

I have some data with a representative subpart here
id visitdate ecgday
5130 1999-09-22 1999-09-22
6618 NA 1999-12-01
10728 2000-06-27 2000-06-27
968 1999-04-19 1999-04-19
5729 1999-09-23 NA
1946 NA NA
15070 1999-11-09 NA
What I want is to create a novel variable visitday which is equal to ecgday, unless ecgday is NA. In that case it should be visitday -> visitdate unless both visitdate and ecgday are NA, where visitday should be NA.
I have tried
int99$visitday <- int99$visitdate
int99$visitday[!is.na(int99$ecgday) & int99$ecgday > int99$visitdate]
<-int99$ecgday[!is.na(int99$ecgday) & int99$ecgday > int99$visitdate]
but it gave the error:
Error in [.data.frame(int99, , c("id", "visitday", "visitdate", :
undefined columns selected
which I understand. Any workaround to get the desired result?

this Should do it:
First if ecday is NA it will be visitday, if not it will be ecgday
int99$visitday <- felse(is.na(int99$ecgday), int99$visitdate , int99$ecgday)
for cases when both have NAs, you can add a next ifelse:
int99$visitday <- ifelse(is.na(int99$visitdate), int99$ecgday , int99$visitdate)

Thanks to Derek Corcoran
That worked except for a very small thing that visitday ended up being numeric despite both ecgday and visitdate being Date.
That was easy fixed by adding a line
int99$visitday <- ifelse(is.na(int99$ecgday), int99$visitdate , int99$ecgday)
int99$visitday <- ifelse(is.na(int99$visitdate), int99$ecgday , int99$visitdate)
int99$visitday <- as.Date(int99$visitday, origin="1970-01-01")
Thank You so much.

In my view the best way to deal with such NA comparison it to change dates to numeric and all NAs to 0. Though quite possibly I did not understand the question correctly, in case you want to set the new variable to the higher of the visitdate and ecgday, you can try this.
Or it can be adapted to any other requirement
int99<- read.table(header = T, colClasses = c("numeric", "Date","Date"),
text="id visitdate ecgday
5130 1999-09-22 1999-09-22
6618 NA 1999-12-01
10728 2000-06-27 2000-06-27
968 1999-04-19 1999-04-19
5729 1999-09-23 NA
1946 NA NA
15070 1999-11-09 NA" )
dt<- apply(int99[,2:3], 2 , zoo::as.Date)
dt
dt[is.na(dt)]<- 0
dt
mx<- apply(dt,1,max)
mx[mx==0]<- NA
int99$visitday<- zoo::as.Date(mx)
int99

Related

Override bad/wrong values in a main table with NA or null values listed on another lookup table in R

The main table is large.
Has certain undesired values that I want to override.
I am writing into a lookup table the keys and new_value (NA) to override.
Both have 2 keys (session_id and datetime), not one unique.
Other similar questions goes into replacing an NA with a value, but I want to replace the value with an NA. Clear cells contents.
The 2 keys limits the use of match() which can handle only one key and first occurrences.
left_join or merge operations, would create a new large dataframe with an added column, and will fill them up with NA for every row, and it would also require to perform some 'coalescing' into an NA value, which I guess, doesn't exists.
I don't want to remove the entire row, as there are many other columns with its own values. I just want to delete that value from that cells.
I think, that in short, it is just an assignment operation to a filtered subset based on 2 keys. Something like:
table[ lookup_paired_keys(session_ids, lookup_datetimes) ] <- NA
Follows a sample dataset with undesired "0" to replace by NA. The real dataset may contain other kind of values.
table <- read.table(text = "
session_id datetime CaloriesDaily
1233815059 2016-05-01 5555
8583815123 2016-05-03 4444
8512315059 2016-05-04 2432
8583815059 2016-05-12 0
6290855005 2016-05-10 0
8253242879 2016-04-30 0
1503960366 2016-05-20 0
1583815059 2016-05-19 2343
8586545059 2016-05-20 1111
1290855005 2016-05-11 5425
1253242879 2016-04-25 1234
1111111111 2016-05-09 6542", header = TRUE)
table$datetime = as.POSIXct(table$datetime, tz='UTC')
table
lookup <- read.table(text = "
session_id datetime CaloriesDaily
8583815059 2016-05-12 NA
6290855005 2016-05-10 NA
8253242879 2016-04-30 NA
1503960366 2016-05-12 NA", header = TRUE)
lookup$datetime = as.POSIXct(lookup$datetime, tz='UTC')
lookup$CaloriesDaily = as.numeric(lookup$CaloriesDaily)
lookup
SOLVED
After reading the accepted answer, I want to share the final outcome.
And as I have the main table a data.table and I got some warns regarding nomenclature, be aware that I am no expert, but is working with this example dataset and my own.
lookup_by : Standard Lookup operation
lookup_by <- function(table, lookup, by) {
merge( table, lookup, by=by )
}
### usage ###
keys = c('session_id','datetime')
lookup_by( table, lookup, keys)
Adopted solution: match_by
Like match() but with keys.
It returns a vectors with row numbers when keys match.
So that, assignment like table[ ..matches.. ] <- NA is possible.
match_by <- function(table, lookup, by) {
table <- setDT(table)[,..by]
table$idx1 <- 1:nrow(table)
lookup <- setDT(lookup)[,..by]
lookup$idx2 <- 1:nrow(lookup)
m <- merge( table , lookup, by=by )
return( m[ ,c('idx1','idx2') ] )
}
### usage ###
keys = c('session_id','datetime')
rows = match_by( table, lookup, keys)
overrides <- c(lookup[ rows$idx2, 'CaloriesDaily' ])
table[ rows$idx1, 'CaloriesDaily' ] <- overrides
table
Here’s a solution using dplyr::semi_join() and dplyr::anti_join() to split your dataframe based on whether the id and date keys match your lookup table. I then assign NAs in just the subset with matching keys, then row-bind the subsets back together. Note that this solution doesn’t preserve the original row order.
library(dplyr)
table_ok_vals <- table %>%
anti_join(lookup, by = c("session_id", "datetime"))
table_replaced_vals <- table %>%
semi_join(lookup, by = c("session_id", "datetime")) %>%
mutate(CaloriesDaily = NA_real_)
table <- bind_rows(table_ok_vals, table_replaced_vals)
table
Output:
session_id datetime CaloriesDaily
1 1233815059 2016-05-01 5555
2 8583815123 2016-05-03 4444
3 8512315059 2016-05-04 2432
4 1503960366 2016-05-20 0
5 1583815059 2016-05-19 2343
6 8586545059 2016-05-20 1111
7 1290855005 2016-05-11 5425
8 1253242879 2016-04-25 1234
9 1111111111 2016-05-09 6542
10 8583815059 2016-05-12 NA
11 6290855005 2016-05-10 NA
12 8253242879 2016-04-30 NA

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

Replacing expression within data.table

I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213

Want to select a column based on a return (within that same column) below a threshold in R

I have a return data frame (xts, zoo object) of size 1379 x 843. It should be read as date x security.
This is an example of the input:
BIIB.US.Equity JNJ.US.Equity BLUE.US.Equity BMRN.US.Equity AGN.US.Equity
2018-06-15 -0.5126407 0.001633853 -0.070558376 0.0846854857 -0.004426559
2018-06-18 -0.052158804 -0.310521165 -0.035226652 -0.0206967213 -0.008430535
2018-06-19 0.010099613 0.010303330 0.006510048 0.0004184976 0.007745167
2018-06-20 0.016504588 -0.004324060 0.029808774 0.0284459318 0.012366368
2018-06-21 0.001616924 -0.004834480 0.023211360 0.0009151922 -0.015411839
2018-06-22 -0.004136679 0.010374640 -0.065652522 0.0097023265 0.005322048
Now I would like to return a different list:
BIIB.US.Equity JNJ.US.Equity
2018-06-15 -0.5126407 0.001633853
2018-06-18 -0.052158804 -0.30521165
2018-06-19 0.010099613 0.010303330
2018-06-20 0.016504588 -0.004324060
2018-06-21 0.001616924 -0.004834480
2018-06-22 -0.004136679 0.010374640
As you can see the second list only contains 2 columns because there is a dip in 51% on the first security at time 2018-06-15, and a dip in 30% on the second security in the second at time 2018-06-18. Both of which exceed the threshold of 30%
What I want is to get a new data frame from my current which selects securities where there is an instance of a 30% drop in security return or greater.
Currently I have tried:
df1 <- returns < -.3
returns[df1]
but this returns the error:
Error in `[.xts`(returns, df1) : 'i' or 'j' out of range
I have also tried this:
cls <- sapply(returns, function(c) any(c < -.3))
a<- returns[, cls, with = FALSE]
Yet that returns a matrix of the same size, only with a lot of NA values.
Is there something I am missing?
Basically what I'd expect to get out is a data frame "df" of size 1379 x (something less than 843) where df is all columns where there is an instance of a daily drop of -.3 or less.
EDIT:
To those that have tried to help, thank you, but the output returns as such (I assigned the call to a):
> a
BIIB.US.Equity JNJ.US.Equity BLUE.US.Equity BMRN.US.Equity AGN.US.Equity PFE.US.Equity NBIX.US.Equity
> summary(a)
Index
Min. :NA
1st Qu.:NA
Median :NA
Mean :NA
3rd Qu.:NA
Max. :NA
> str(a)
An 'xts' object of zero-width
This should work:
df[, sapply(df, function(x) min(x, na.rm = TRUE) <= -0.3)]
Ok, it should work this way, using package data.table :
Let's try with an example data set that contains some NAs.
library(data.table)
set.seed(1)
x <- rnorm(10)*0.1
y <- x
z <- rnorm(10)+1
equities <- data.table(x,y,z)
equities[ sample(1:10,3), x:=NA]
equities[ sample(1:10,2), y:=NA]
equities[ sample(1:10,2), z:=NA]
print(equities)
x y z
1: -0.06264538 -0.06264538 NA
2: 0.01836433 0.01836433 1.3898432
3: -0.08356286 -0.08356286 0.3787594
4: 0.15952808 0.15952808 -1.2146999
5: 0.03295078 NA 2.1249309
6: NA NA 0.9550664
7: NA 0.04874291 0.9838097
8: 0.07383247 0.07383247 NA
9: NA 0.05757814 1.8212212
10: -0.03053884 -0.03053884 1.5939013
Choose the right columns, as described in Melissa's post:
myChoice <- sapply(equities, function(x) min(x, na.rm=T) <= -0.3)
And eventually:
newequities <- equities[ , myChoice , with=F]
print(newequities)
z
1: NA
2: 1.3898432
3: 0.3787594
4: -1.2146999
5: 2.1249309
6: 0.9550664
7: 0.9838097
8: NA
9: 1.8212212
10: 1.5939013
As you didn't provide an input/output example I am not sure if I understand you correctly but try
df[colSums(df <= -0.3, na.rm = T) > 0]
Edit
added na.rm = T after OPs update
A slight addendum to this, because I came back into work today and noticed the answer wasn't quite what I wanted. If you want to select the columns based on the row values, then the answer provided that I marked correct is just missing one comma!
The current answer selects all the rows needed, but for the desired output described in the original post, you must use the command:
returns[, sapply(returns, function(x) min(x, na.rm = TRUE) <= -.3)]
notice this has a comma in the beginning in order to select the rows.
Hopefully this helps someone else at some point!

Selecting rows with same result in different columns in R

I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)

Resources