Yes I'm doing a CTF, but don't worry I'm not going to get first place either way. Just trying to figure out this problem that's been driving me nuts. Here is the coded text:
1143 4423 1553 5321 3111 2253 5344 2311 4414 5215 3131 4324 3344 2315 2315 1142 4434 2115 5115 4254 3211 3300
I know it's not ASCII, hexadecimal, or octal, or anything like SHA1 or MD5. From the looks of it, only digits 0-5 are used, so it could be base6?
Is it base 6, and if so, how does one convert this to text?
This is senary base (known as base 6, heximal or seximal). See wikipedia.
Related
Like the title said, is there anyway to get an exact number from a Holt-winters forecast? For example, say I have a time-series object like this:
Date Total
6/1/2014 150
7/1/2014 219
8/1/2014 214
9/1/2014 47
10/1/2014 311
11/1/2014 198
12/1/2014 169
1/1/2015 253
2/1/2015 167
3/1/2015 262
4/1/2015 290
5/1/2015 319
6/1/2015 405
7/1/2015 395
8/1/2015 391
9/1/2015 345
10/1/2015 401
11/1/2015 390
12/1/2015 417
1/1/2016 375
2/1/2016 397
3/1/2016 802
4/1/2016 466
After storing it in variable hp, I used Holt Winters to make a forecast:
hp.ts <- ts(hp$Total, frequency = 12, start = c(2014,4))
hp.ts.hw <- HoltWinters(hp.ts)
library(forecast)
hp.ts.hw.fc <- forecast.HoltWinters(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
However, what I need to know is how exactly the Total in 2016/05 is (predictly) going to be. Is there anyway to get the exact value?
By the way, I noticed that the blue (forecast) line is NOT connected to the black line. Is that normal? Or I should fix my code?
Thank you for reading.
I don't know why you went round around while you have called the library(forecast). Below provides direct answers for your questions:
hp.ts.hw <- hw(hp.ts)
hp.ts.hw.fc <- forecast(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
hp.ts.hw.fc
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Mar 2016 546.5311 448.2997 644.7624 396.2992 696.7630
Apr 2016 623.7030 525.4716 721.9344 473.4711 773.9349
May 2016 671.8989 573.6675 770.1303 521.6670 822.1309
Jun 2016 667.3722 569.1408 765.6036 517.1402 817.6041
Jul 2016 500.0710 401.8396 598.3024 349.8390 650.3030
I'm not sure if I understood your doubt. But you can get the forecasted value by:
hp.ts.hw.fc$mean
You can use accuracy function to measure how good is your results.
Ok so I need I have a .txt file with names followed by their respective phone numbers and need to grab all the numbers following the ###-###-#### syntax which I have accomplished with this code
grep -E "([0-9]{3})-[0-9]{3}-[0-9]{4}" telephonefile_P2
but my problem is that there are instances of
(###)-###-####
(###) ### ####
### ### ####
###-####
This is the file:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Star Club 49 040–31 77 78 0
Lolita Spengler (816) 756 8657
Hoffman's Kleider 049 37 1836 027
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Hermann's Speilhaus 49 25 8377 1765
Hal Kubrick 44 1289 332934
Sister Sue 978 0672
Auggie Keller 49 089/594 393
JCCC 913-469-8500
This is my desired output:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500
and I don't know how to account for these alternate forms...
obviously new to Unix, please be gentle!
$ awk '/(^|[[:space:]])(\(?[0-9]{3}\)?[- ])?[0-9]{3}[- ][0-9]{4}([[:space:]]|$)/' file
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500
I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)
I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco
Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.
I have an interesting conundrum. I can create the type of chart I seek interactively, but not automatically. Or, I nearly had it automatically, but something broke. (example data at end of post).
I have my loop working the way I would like, but have run into errors when I add some geom_vline() statements (for us, denoting significant changes in our production environment). I've tried working through it outside of the loop and am able to recreate the issue with details below.
I have the following steps:
create a vector with the list of changes:
changeVector <- c(as.Date("2011-11-30"),as.Date("2011-12-05"))
[WORKS] create a plot with the data below, and it works:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_line(aes(group=REGION,color=REGION))
[WORKS] try to add the geom_vline(xintercept=c(15308,15313)), and it works (but only if the geom_vline is at the end):
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_line(aes(group=REGION,color=REGION))+geom_vline(xintercept=c(15308,15313))
[FAIL] try to add the geom_vline(xintercept=changeVector) - I had problems with this for some reason, and had to add as.numeric to recognize the vector values properly:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_vline(xintercept=as.numeric(changeVector))+geom_line(aes(group=REGION,color=REGION))
When this step runs, I get the wonderfully useful error message:
Error: Non-continuous variable supplied to scale_x_continuous.
So, any ideas? If I try to add an aesthetic component to the geom_vline, I still make no progress. My desire was to have the geom_vline preceding the geom_line because the vline is context, not data.
Thank you for your help!
Here is a subset of the data (dataFile name df):
OBSDATE REGION COUNT AVG_RESP
2011-11-29 EMEA 293 4.430375
2011-11-30 EMEA 299 4.802876
2011-12-01 EMEA 292 4.362363
2011-12-02 EMEA 293 4.209829
2011-12-03 EMEA 294 4.262959
2011-12-04 EMEA 294 4.207959
2011-12-05 EMEA 293 4.172594
2011-12-06 EMEA 293 4.230887
2011-12-07 EMEA 298 4.259329
2011-12-08 EMEA 293 4.197645
2011-11-29 Americas 296 2.841182
2011-11-30 Americas 296 2.932196
2011-12-01 Americas 292 2.766438
2011-12-02 Americas 293 2.819556
2011-12-03 Americas 291 2.710584
2011-12-04 Americas 295 2.728407
2011-12-05 Americas 290 2.764310
2011-12-06 Americas 290 2.817483
2011-12-07 Americas 295 2.733864
2011-12-08 Americas 291 2.732405
2011-11-29 APAC 328 7.294024
2011-11-30 APAC 325 7.091046
2011-12-01 APAC 314 6.969236
2011-12-02 APAC 327 6.920428
2011-12-03 APAC 325 7.226308
2011-12-04 APAC 324 7.046296
2011-12-05 APAC 318 7.075094
2011-12-06 APAC 317 7.016467
2011-12-07 APAC 318 7.187358
2011-12-08 APAC 318 7.310220
I'm not exactly sure why it is doing that, but here is a workaround that keeps the vertical lines behind the data lines:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP)) +
geom_blank() +
geom_vline(xintercept=as.numeric(changeVector)) +
geom_line(aes(group=REGION,color=REGION))
EDIT:
Here is another workaround: explicitly specify that the x axis is to be a date, rather than have ggplot guess. When it guesses, it looks at the first layer plotted, which is the vertical lines. Given that the xintercept have to be given as numbers rather than dates, the x axis is assumed to be continuous/numeric. When the next layer is drawn, the dates of the x axis can not be mapped onto that and an error is thrown.
ggplot(df,aes(x=OBSDATE,y=AVG_RESP)) +
geom_vline(xintercept=as.numeric(changeVector)) +
geom_line(aes(group=REGION,color=REGION)) +
scale_x_date()