I'm trying to decrypt the following table:
Date Label Code
26.09.2018 a 41310001075389700
27.09.2018 a 448160001075586000
02.10.2018 a 576990001074818000
28.09.2018 a 32270001075371700
11.10.2018 a 511660001074989852
01.10.2018 a 188260001074810000
09.10.2018 a 395980001075290000
10.10.2018 a 461080001075350000
11.09.2018 a 119400001074791000
13.09.2018 a 451710001075704000
17.09.2018 a 245950001074796000
18.09.2018 a 260001074888965
20.09.2018 a 390150001074855000
24.09.2018 a 558580001074794000
25.09.2018 a 322670001074798000
11.10.2018 a 285750001075053852
19.09.2018 a 15400001074929400
03.10.2018 a 550850001074820000
28.09.2018 a 359980001075372000
27.09.2018 b 445000000272901000
11.10.2018 b 86250000272927000
10.09.2018 b 490632000272892000
11.09.2018 b 234130000272888574
26.09.2018 b 007910000273087757
28.09.2018 b 459100000272797000
12.09.2018 b 085370000272864511
17.09.2018 b 80150000272953600
18.09.2018 b 120860000273659000
01.10.2018 b 243850000272906000
04.10.2018 b 315990000272946000
02.10.2018 b 54630000272868600
08.10.2018 b 649470000272938000
13.09.2018 b 514820000272867584
02.10.2018 b 14390000273446600
29.09.2018 b 177190000272714000
05.10.2018 b 423250000272924000
10.10.2018 b 613380000272892000
Maximum understandable information for me are the central part in any "Code" possibly connected to "Label", e.g. "a" : "******107******" and "b" : "*****27*****". If anyone has idea, I'd be very happy.
The data need to be decrypted is the net between "Label" and "Code".
Related
I want to round these values but they are diverse so I cant set a general rule, like round(pvalue,2). How do I accomplish this?
id <- LETTERS[1:10]
pvalue <- c(0.3,0.0432,0.0032,0.67,0.00000003,0.0069,0.782, 0.0004, 0.00076,0.341)
df <- data.frame(id,pvalue)
df
id pvalue
1 A 0.30000000
2 B 0.04320000
3 C 0.00320000
4 D 0.67000000
5 E 0.00000003
6 F 0.00690000
7 G 0.78200000
8 H 0.00040000
9 I 0.00076000
10 J 0.34100000
It should look like:
id pvalue
1 A 0.3
2 B 0.04
3 C 0.003
4 D 0.67
5 E <0.0001
6 F 0.007
7 G 0.78
8 H 0.0004
9 I 0.0007
10 J 0.34
I think you're using the wrong tool. If you want to prepare p values for scientific display you can use the function pvalString in lazyWeave to convert your numeric values into correctly formatted strings.
library(lazyWeave)
pvalue <- c(0.3,0.0432,0.0032,0.67,0.00000003,0.0069,0.782, 0.0004, 0.00076,0.341)
pvalString(pvalue)
You can edit the parameters to get exactly what you want but the default settings will give you the standard convention.
[1] "0.30" "0.043" "0.003" "0.67" "< 0.001" "0.007" "0.78" "< 0.001" "< 0.001" "0.34"
# example
a <- data.frame(name=c("A","B","C"), KW=c(201902,201904,201905),price=c(1.99,3.02,5.00))
b <- data.frame(KW=c(201903,201904,201904),price=c(1.98,3.00,5.00),name=c("a","b","c"))
I want to match a and b with fuzzy logic, using the variables KW and price. I want to allow a tolerance of +/- 1 for KW and a tolerance for +/- 0.02 in price.
The desired outcome should look like this:
name.x KW.x price.x KW.y price.y name.y
1 A 201902 1.99 201903 1.98 a
2 B 201904 3.02 201904 3.00 b
3 C 201905 5.00 201904 5.00 c
I would prefer to find a solution using the fuzzyjoin package. I tried so far using the fuzzy_inner_join function and specifying my desired tolrences for KW and price using the match_fun argument. However, I couldn't get it to work.
Looking for help, how to solve this problem.
You can create a cartesian product of two dataframes using merge and then subset the rows which follow our required conditions.
subset(merge(a, b, by = NULL), abs(KW.x - KW.y) <= 1 &
abs(price.x - price.y) <= 0.02)
# name.x KW.x price.x KW.y price.y name.y
#1 A 201902 1.99 201903 1.98 a
#5 B 201904 3.02 201904 3.00 b
#9 C 201905 5.00 201904 5.00 c
I have the following dataset
Name<-c('A','A','B','C','B','C','D','B','C','A','D','C','B','C','A','D','C','B','A','D','C','B')
Rate<-c(12,13,4,8,7,3,6,8,5,4,7,5,9,4,7,2,7,3,9,13,14,12)
Date<-c('1998-11-11', '1992-12-01','2010-06-17', '2001-10-3','2019-4-01', '2020-4-23','2021-2-01', '1995-12-01',
'1994-7-11', '2023-3-01','2022-06-17', '1982-10-3','1898-4-01', '2027-4-23','1927-2-01', '2028-12-01',
'1993-5-21', '2013-2-09','2020-01-17', '1987-4-3','1881-5-01', '2024-5-23')
df<-cbind.data.frame(Name,Rate, Date)
df
Name Rate Date
1 A 12 1998-11-11
2 A 13 1992-12-01
3 B 4 2010-06-17
4 C 8 2001-10-3
5 B 7 2019-4-01
6 C 3 2020-4-23
7 D 6 2021-2-01
8 B 8 1995-12-01
9 C 5 1994-7-11
10 A 4 2023-3-01
11 D 7 2006-06-17
12 C 5 1982-10-3
13 B 9 1898-4-01
14 C 4 2027-4-23
15 A 7 1927-2-01
16 D 2 2028-12-01
17 C 7 1993-5-21
18 B 3 2013-2-09
19 A 9 2020-01-17
20 D 13 1987-4-3
21 C 14 1881-5-01
22 B 12 2024-5-23
I want to write a function in R to do the following :
Find the Standard Deviation for each type of Name (A, B, C, D) of historical data. Historical data is any records with Date < Dec'2018. Future records would not be used to calculate the SD for type of Name. I want to then add the SD of the historical data to the future Rates of respective type of Name(A, B, C, D). Future Rates are the one with Date > Dec'2018. Could anyone please help me to write this function?
Below is the function I am working on
with(mutate(df,timediff = as.yearmon(Date) - as.yearmon(Sys.Date()) ),
tapply(df$Rate, Name, function(x){
ifelse(timediff < 0 ,
x + sd(x),
x)
}, simplify=FALSE) )
I have a large dataframe. I want to calculate the correlation coefficient between hot and index, by class
ID hot index class
41400 10 2 a
41400 12 2 a
41400 75 4 a
41401 89 5 a
41401 25 3 c
41401 100 6 c
20445 67 4 c
20445 89 6 c
20445 4 1 c
20443 67 5 d
20443 120.2 7 a
20443 140.5 8 d
20423 170.5 10 d
20423 78.1 5 c
Intended output
a = 0.X (assumed numbers)
b = 0.Y
c = 0.Z
I know I can use the by command, but I am not able to.
Code
cor_eqn = function(df){
m = cor(hot ~ index, df);
}
by(df,df$class,cor_eqn,simplify = TRUE)
Another option is to use a data.table instead of a data.frame. You can just call setDT(df) on your existing data.frame (I created a data.table initially below):
library(data.table)
##
set.seed(123)
DT <- data.table(
ID=1:50000,
class=rep(
letters[1:4],
each=12500),
hot=rnorm(50000),
index=rgamma(50000,shape=2))
## set key for better performance
## with large data set
setkeyv(DT,class)
##
> DT[,list(Correlation=cor(hot,index)),by=class]
class Correlation
1: a 0.005658200
2: b 0.001651747
3: c -0.002147164
4: d -0.006248392
You can use dplyr for this:
library(dplyr)
gp = group_by(dataset, class)
correl = dplyr::summarise(gp, correl = cor(hot, index))
print(correl)
# class correl
# a 0.9815492
# c 0.9753372
# d 0.9924337
Note that class and df are R functions, names like these can cause trouble.
I am having a terrible time running 'ddply' over two variables in what seems like it should be a simple command.
Sample data (df):
Brand Day Rev RVP
A 1 2535.00 195.00
B 1 1785.45 43.55
C 1 1730.87 32.66
A 2 920.00 230.00
B 2 248.22 48.99
C 3 16466.00 189.00
A 1 2535.00 195.00
B 3 1785.45 43.55
C 3 1730.87 32.66
A 4 920.00 230.00
B 5 248.22 48.99
C 4 16466.00 189.00
I am using the command:
df2<-ddply(df, .(Brand, Day), summarize, Rev=mean(Rev), RVP=sum(RVP))
My dataframe has about 2600 observations, and there are 45 levels of "Brand" and up to 300 levels of "Day" (which is coded using 'difftime').
I am able to easily use 'ddply' when simply grouping by "Day," but when I also try to group by "Brand," my computer freezes up.
Thoughts?
You should read through the help pages for aggregate, by, ave, and tapply, paying close attention to the types of the arguments each one of them expects and the names of the arguments as well. Then run all of the examples or demo(). The main thing #hadley did with pkg:plyr and reshape/reshape2 was to impose some degree of regularity, but it was at the expense of speed. I do understand why he did it, especially when I try to use the base::reshape function, but also when I forget as I repeatedly do, which of these requires a list, which requires the FUN= argument label, which needs interaction() for the grouping variable, .... since they are all somewhat different.
> aggregate(df[3:4], df[1:2], function(d) mean(d) )
Brand Day Rev RVP
1 A 1 2535.000 195.00
2 B 1 1785.450 43.55
3 C 1 1730.870 32.66
4 A 2 920.000 230.00
5 B 2 248.220 48.99
6 B 3 1785.450 43.55
7 C 3 9098.435 110.83
8 A 4 920.000 230.00
9 C 4 16466.000 189.00
10 B 5 248.220 48.99