so I've been trying to create a indexed stock chart as a part of a project while learning R. Now I'd like to do the same with indexed values, so I want to create a vector of indexed values for each of my stocks. I tried the following:
indeksih <- apply(kombo, huhtamaki, FUN = huhtamaki/huhtamaki[1])
however this gives me Error in Ops.data.frame(huhtamaki, huhtamaki[1]) :
‘/’ only defined for equally-sized data frames
This is how my data looks like:
head(kombo)
Date Huhtamaki Sampo Kone
1 2019-12-30 41.38 38.91 58.28
2 2019-12-27 41.84 39.07 59.14
3 2019-12-23 41.66 39.13 59.02
4 2019-12-20 41.57 39.22 59.06
5 2019-12-19 40.69 38.99 58.32
6 2019-12-18 40.74 38.41 57.68
We can use
indexksi <- kombo$Huhtamaki/kombo$Huhtamaki[1]
Simply dividing the column by the first element of the column:
kombo[,"Huhtamaki"]/kombo[1, "Huhtamaki"]
If you want to do it on many columns a data.table approach can be useful
library(data.table)
setDT(kombo)
kombo[,lapply(.SD, function(x) x/x[1]), .SDcols = names(kombo[, -"date"])]
Related
I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12
The main table is large.
Has certain undesired values that I want to override.
I am writing into a lookup table the keys and new_value (NA) to override.
Both have 2 keys (session_id and datetime), not one unique.
Other similar questions goes into replacing an NA with a value, but I want to replace the value with an NA. Clear cells contents.
The 2 keys limits the use of match() which can handle only one key and first occurrences.
left_join or merge operations, would create a new large dataframe with an added column, and will fill them up with NA for every row, and it would also require to perform some 'coalescing' into an NA value, which I guess, doesn't exists.
I don't want to remove the entire row, as there are many other columns with its own values. I just want to delete that value from that cells.
I think, that in short, it is just an assignment operation to a filtered subset based on 2 keys. Something like:
table[ lookup_paired_keys(session_ids, lookup_datetimes) ] <- NA
Follows a sample dataset with undesired "0" to replace by NA. The real dataset may contain other kind of values.
table <- read.table(text = "
session_id datetime CaloriesDaily
1233815059 2016-05-01 5555
8583815123 2016-05-03 4444
8512315059 2016-05-04 2432
8583815059 2016-05-12 0
6290855005 2016-05-10 0
8253242879 2016-04-30 0
1503960366 2016-05-20 0
1583815059 2016-05-19 2343
8586545059 2016-05-20 1111
1290855005 2016-05-11 5425
1253242879 2016-04-25 1234
1111111111 2016-05-09 6542", header = TRUE)
table$datetime = as.POSIXct(table$datetime, tz='UTC')
table
lookup <- read.table(text = "
session_id datetime CaloriesDaily
8583815059 2016-05-12 NA
6290855005 2016-05-10 NA
8253242879 2016-04-30 NA
1503960366 2016-05-12 NA", header = TRUE)
lookup$datetime = as.POSIXct(lookup$datetime, tz='UTC')
lookup$CaloriesDaily = as.numeric(lookup$CaloriesDaily)
lookup
SOLVED
After reading the accepted answer, I want to share the final outcome.
And as I have the main table a data.table and I got some warns regarding nomenclature, be aware that I am no expert, but is working with this example dataset and my own.
lookup_by : Standard Lookup operation
lookup_by <- function(table, lookup, by) {
merge( table, lookup, by=by )
}
### usage ###
keys = c('session_id','datetime')
lookup_by( table, lookup, keys)
Adopted solution: match_by
Like match() but with keys.
It returns a vectors with row numbers when keys match.
So that, assignment like table[ ..matches.. ] <- NA is possible.
match_by <- function(table, lookup, by) {
table <- setDT(table)[,..by]
table$idx1 <- 1:nrow(table)
lookup <- setDT(lookup)[,..by]
lookup$idx2 <- 1:nrow(lookup)
m <- merge( table , lookup, by=by )
return( m[ ,c('idx1','idx2') ] )
}
### usage ###
keys = c('session_id','datetime')
rows = match_by( table, lookup, keys)
overrides <- c(lookup[ rows$idx2, 'CaloriesDaily' ])
table[ rows$idx1, 'CaloriesDaily' ] <- overrides
table
Here’s a solution using dplyr::semi_join() and dplyr::anti_join() to split your dataframe based on whether the id and date keys match your lookup table. I then assign NAs in just the subset with matching keys, then row-bind the subsets back together. Note that this solution doesn’t preserve the original row order.
library(dplyr)
table_ok_vals <- table %>%
anti_join(lookup, by = c("session_id", "datetime"))
table_replaced_vals <- table %>%
semi_join(lookup, by = c("session_id", "datetime")) %>%
mutate(CaloriesDaily = NA_real_)
table <- bind_rows(table_ok_vals, table_replaced_vals)
table
Output:
session_id datetime CaloriesDaily
1 1233815059 2016-05-01 5555
2 8583815123 2016-05-03 4444
3 8512315059 2016-05-04 2432
4 1503960366 2016-05-20 0
5 1583815059 2016-05-19 2343
6 8586545059 2016-05-20 1111
7 1290855005 2016-05-11 5425
8 1253242879 2016-04-25 1234
9 1111111111 2016-05-09 6542
10 8583815059 2016-05-12 NA
11 6290855005 2016-05-10 NA
12 8253242879 2016-04-30 NA
I have two data tables: claims and SC. They have one column in common - subCoverageKey. In claims I want to create a new columns subCoverageeyClaim. For every row in claims I want to take the corresponding subCoverageeyClaim value from SC - aka. matching by subCoverageKey. In case there are multiple subCoverageeyClaim for a subCoverageKey a random choice should be taken.
How can this be done?
I tried using sample() but couldn't get it to work.
The resulting data.table should look something like this:
claims
clientID claimID claimYear amount clDate subCoverageKey subCoverageKeyClaims
1: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_accommodation b83f2a41-64c3-4571-97e7-6534f9629104
2: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_bundle f0a9ee55e-31b1-46f8-a0d4-91154e6c0998
3: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_accommodation f0a9ee55e-31b1-46f8-a0d4-91154e6c0998
4: 1 OP_064c03aa-f2d5-4768-9c4e-51b54a725e56 2019 78.25 2019-06-09 IP_upgrade 74390be79-dc1e-4f7a-a0c0-f548c0b9ffcb
5: 1 OP_064c03aa-f2d5-4768-9c4e-51b54a725e56 2019 78.25 2019-06-09 Daily_cash 7a61bcf3-9e6d-4c4b-be2b-1381527dedd6
---
2637586: 130999 OP_b165c233-cd77-461b-b37d-704ac647d878 2019 8.66 2019-09-13 IP_upgrade ffdef3f3-2996-4d1a-bf51-a78b43029079
2637587: 130999 OP_0a11b09d-fd4c-427e-ad7b-8c67c2fa70e5 2019 61.16 2019-09-17 Daily_cash 0a9ee55e-31b1-46f8-a0d4-91154e6c0998
2637588: 131000 OP_3fb03980-8642-48bf-8967-55e410243868 2019 12.64 2019-05-10 IP_upgrade 4390be79-dc1e-4f7a-a0c0-f548c0b9ffcb
2637589: 131000 OP_64d85cc6-db73-408a-a02a-6b0c811ee06d 2019 8.44 2019-05-02 IP_bundle ffdef3f3-2996-4d1a-bf51-a78b43029079
2637590: 131000 OP_8b5585d8-d8e0-47ed-9005-3584062d4103 2019 3.57 2019-03-10 IP_accommodation ffdef3f3-2996-4d1a-bf51-a78b43029079
The data.table i am planning to join are quite large with ~300000 observations, so i am looking for something that wont take that much time.
i found the answer i was looking for :
convert SC table into a list
listSC <- split(SC, SC$subCoverageKeyClaims)
use sapply randomly pick the corresponding subCoverageKeyClaims
claims$subCoverageKeyClaims <- sapply(claims$subCoverageKey, function(x){
sample(listSC[[x]]$subCoverageKey, 1)
})
Here is one tidyverse way :
library(tidyverse)
SC %>%
select(claimID, subCoverageKey) %>%
group_by(subCoverageKey) %>%
nest %>%
ungroup %>%
right_join(claims, by = 'subCoverageKey') %>%
mutate(subCoverageKeyClaims = map_chr(data, ~sample(.x$claimID, 1)))
Since there are duplicates in claims dataset we create a list column for same subCoverageKey in SC and join it with claims dataset. We can then select one random claimID value from their respective data.
I've got a sql output into a data.frame which looks like this:
dateTime resultMean SensorDescription
1 2009-01-09 21:35:00 7.134589 Aanderaa Optode - Type 3835
2 2009-01-09 21:35:00 7.813000 Seabird SBE45 Thermosalinograph
3 2009-01-09 21:35:00 8.080399 Turner SCUFA II Chlorophyll Fluorometer
4 2009-01-09 21:35:00 7.818604 ADAM PT100 PRT
5 2009-01-09 21:36:00 7.818604 ADAM PT100 PRT
I want to turn it into a frame like so:
dateTime Aanderaa Optode - Type 3835 Seabird SBE45 Thermosalinograph Turner SCUFA II Chlorophyll Fluorometer ADAM PT100 PRT
1 2009-01-09 21:35:00 7.134589 7.813000 8.080399 7.818604
Currently I've got a function which splits by SensorDescription, then loops over the list with merge.
Is there a better way of doing this using built in functions? I've looked at plyr, ddply etc and nothing seams to do quite what I want.
the current merging loop functions looks like this:
listmerge = function(datalist){
mdat = datalist[[1]][1:2]
for(i in 2:length(datalist)){
mdat = join(mdat,datalist[[i]][1:2], by="dateTime", match = "all")
}
You can use dcast from the reshape2 package:
d <- data.frame(x=1, y=letters[1:10], z=runif(10))
dcast(x ~ y, data=d)
Using z as value column: use value.var to override.
x a b c d e f g h i j
1 1 0.7582016 0.4000201 0.5712599 0.9851774 0.9971331 0.2955978 0.9895403 0.6114973 0.323996 0.785073
reshape from the base stats package can also accomplish this, but the syntax is a little more difficult.
reshape(d, idvar='x', timevar='y', direction='wide')
x z.a z.b z.c z.d z.e z.f z.g z.h z.i z.j
1 1 0.7582016 0.4000201 0.5712599 0.9851774 0.9971331 0.2955978 0.9895403 0.6114973 0.323996 0.785073
I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)