I have an old software that uses some kind of database that saving the data on TXB/TZB files. The data in those files looks like this:
1911 ¸£
1913 ¼£
1916 ְ£
1921 ִ£
1922 ָ£
1923 ּ£
1924 װ£
1925 ה£
1926 ט£
1929 ל£
1930 פ£
1931 £
1932 ₪
1933 ₪
1934 ,₪
1935 <₪
1936 h₪
1937 x₪
1938 €₪
1939 ”₪
Someone know this extension? How to decode the data? it displays the data partially (Years in the example above), but around that data there are some unrecognized characters.
Related
lately I met a problem that took me quite a long time to figure it out but could not in the end. I want to use pgmm function in the package plm to produce GMM estimate on a cross-section country data including 180 countries and 65 time periods. Here is my code:
pgmm(D_rcr ~ lag(D_rcr,1) +
eco_cycle + I(log(Human_trend)) +
I(log(capital_trend)) + I(log(rtfpna)) + exp_rate + urban +
industry + service| plm::lag(mpk3_delta,3:6),data= data_test,
index = c("country","year"),effect = "twoways",transformation = "ld")
And the data is like:
country year D_rcr eco_cycle Human_trend capital_trend rtfpna exp_rate urban industry service
1000 Burkina Faso 1999 0.0074201618 0.0295545705 4.644064 23946.998 0.8284378 -8.221149e-06 17.166 25.19151 42.19550
1001 Burkina Faso 2000 -0.0046062428 -0.0085762554 4.781708 25026.203 0.8177401 -8.013943e-06 17.844 21.52736 45.66413
1002 Burkina Faso 2001 -0.0074698958 -0.0022468581 4.942214 26203.394 0.8430429 -4.433730e-06 18.540 19.47667 43.47496
1003 Burkina Faso 2002 -0.0072339948 -0.0180040290 5.102502 27513.395 0.8564266 -4.243651e-06 19.258 17.52184 43.92530
1004 Burkina Faso 2003 0.0208224248 -0.0013267292 5.262760 28994.841 0.8928111 -4.900598e-06 19.996 21.18051 41.74380
1005 Burkina Faso 2004 0.0077643394 -0.0164384391 5.424015 30686.577 0.9057222 -5.039807e-06 20.757 21.17522 44.30414
1006 Burkina Faso 2005 -0.0162568441 0.0079704026 5.588694 32625.279 0.9540021 -6.000714e-06 21.537 17.97970 42.98950
1007 Burkina Faso 2006 0.0157383040 0.0101814490 5.759905 34843.140 0.9746150 -6.004488e-06 22.339 17.62221 45.65378
1008 Burkina Faso 2007 0.0200791048 -0.0074020766 5.940701 37366.725 0.9920313 -5.925001e-06 23.163 18.95747 48.37608
1009 Burkina Faso 2008 -0.0329526715 -0.0083514921 6.134060 40213.051 1.0026470 -6.737820e-06 23.993 16.22030 43.57783
1010 Burkina Faso 2009 0.0108550329 -0.0364106100 6.341043 43385.046 0.9904070 -5.581967e-06 24.828 19.32466 45.10181
1011 Burkina Faso 2010 0.0003792556 -0.0105232997 6.561223 46865.511 1.0080181 -3.757856e-06 25.665 23.00269 41.38044
1012 Burkina Faso 2011 0.0036570272 -0.0008078762 6.808363 50612.776 1.0000000 -1.947466e-06 26.505 27.15270 39.00203
1013 Burkina Faso 2012 -0.0133615481 0.0088997716 7.066275 54562.488 0.9885733 -4.819380e-06 27.346 24.91152 40.03299
1014 Burkina Faso 2013 -0.0124167169 0.0180233629 7.332700 58635.384 0.9726224 -4.963807e-06 28.186 20.99917 43.38722
1015 Burkina Faso 2014 -0.0093625110 0.0183559642 7.605543 62756.115 0.9531422 -2.547616e-06 29.024 20.47991 44.29216
1016 Burundi 1980 -0.0076063659 -0.0518049023 2.122768 4760.103 0.8508636 -4.026274e-05 4.339 12.61903 25.13108
1017 Burundi 1981 0.0062886770 0.0123692536 2.204532 5003.674 0.9142978 -2.922222e-05 4.503 13.41068 25.27003
1018 Burundi 1982 -0.0073451957 -0.0326804079 2.286727 5257.374 0.8792791 -3.623259e-05 4.674 15.44782 27.69609
1019 Burundi 1983 -0.0048256924 -0.0422295228 2.369051 5513.472 0.8658228 -3.508869e-05 4.850 15.50037 27.25349
1020 Burundi 1984 -0.0083655241 -0.0846945313 2.450960 5763.198 0.8221024 -3.652248e-05 5.033 13.84020 26.02882
1021 Burundi 1985 0.0062427433 -0.0081527185 2.531672 5997.500 0.8820450 -2.695840e-05 5.221 13.00197 25.46080
1022 Burundi 1986 0.0085330972 -0.0050508112 2.610223 6208.165 0.8880036 -2.122831e-05 5.417 13.51567 27.95960
1023 Burundi 1987 0.0013978951 0.0048612717 2.685521 6388.418 0.8895743 -2.379938e-05 5.620 17.12692 27.76170
1024 Burundi 1988 0.0120351151 0.0273960217 2.756473 6533.799 0.9023147 -1.564180e-05 5.830 16.66728 29.08768
1025 Burundi 1989 -0.0040708811 0.0237706740 2.822166 6643.176 0.8862884 -1.854892e-05 6.047 19.66485 26.65657
1026 Burundi 1990 0.0031577402 0.0461310277 2.882089 6718.144 0.9082311 -2.401352e-05 6.271 18.96324 25.15806
1027 Burundi 1991 0.0053723287 0.0913896512 2.944615 6763.149 0.9525304 -2.121296e-05 6.455 19.59612 26.09317
1028 Burundi 1992 -0.0006242234 0.1118378705 3.002747 6784.165 0.9633930 -2.326152e-05 6.637 21.17273 25.29381
1029 Burundi 1993 -0.0140939288 0.0566603249 3.058046 6787.775 0.8787442 -2.704195e-05 6.823 22.44800 24.93331
1030 Burundi 1994 -0.0051914045 0.0446976884 3.112606 6780.856 0.8391036 -2.185189e-05 7.014 22.47806 30.74383
1031 Burundi 1995 -0.0101237974 -0.0044596262 3.168960 6770.888 0.7683206 -2.265241e-05 7.211 19.24821 32.60719
1032 Burundi 1996 -0.0120905700 -0.0733896784 3.230019 6765.241 0.6975604 -1.365663e-05 7.412 12.63033 30.14807
1033 Burundi 1997 -0.0018359105 -0.0465258303 3.298963 6770.493 0.6909751 -9.966291e-06 7.618 15.62753 36.64545
1034 Burundi 1998 0.0010393142 0.0151532770 3.379006 6792.311 0.7113460 -1.557693e-05 7.830 15.84338 36.12517
1035 Burundi 1999 -0.0087320046 0.0148961511 3.473067 6835.203 0.6845708 -1.100638e-05 8.036 16.20578 35.90429
1036 Burundi 2000 -0.0036065995 0.0062503238 3.583363 6902.337 0.6644523 -1.597629e-05 8.246 16.93214 35.00821
1037 Burundi 2001 -0.0015909534 0.0154145281 3.715491 6997.577 0.6616966 -1.724226e-05 8.461 16.49441 37.06893
1038 Burundi 2002 0.0000723279 0.0336641982 3.865710 7124.685 0.6701715 -1.645656e-05 8.682 16.69844 37.51544
1039 Burundi 2003 -0.0081492568 -0.0184660371 4.033378 7286.589 0.6383393 -1.518514e-05 8.908 17.03472 36.60505
1040 Burundi 2004 -0.0053202507 -0.0321812288 4.217188 7484.465 0.6333758 -1.595708e-05 9.139 17.70286 36.85237
1041 Burundi 2005 -0.0073075515 -0.1193560863 4.415421 7718.514 0.5999591 -2.871464e-05 9.375 18.45308 37.05039
1042 Burundi 2006 0.0953559671 -0.1570653899 4.626188 7989.549 0.6067449 -3.880530e-05 9.617 16.71110 38.94535
1043 Burundi 2007 0.0094422379 -0.1988202659 4.847617 8298.925 0.6223342 -3.338575e-05 9.864 18.03837 44.62553
1044 Burundi 2008 0.0217479374 -0.1709468848 5.077926 8647.895 0.6832167 -3.469164e-05 10.118 15.98312 43.42593
1045 Burundi 2009 0.0516023614 -0.0108370357 5.315478 9036.445 0.8509285 -2.165262e-05 10.376 16.63140 42.83632
1046 Burundi 2010 0.0128353068 0.0302280799 5.558862 9461.720 0.9318257 -1.805687e-05 10.641 16.70423 42.84727
1047 Burundi 2011 0.0164302478 0.0549153842 5.820920 9917.799 1.0000000 -1.674949e-05 10.912 16.89924 42.75338
1048 Burundi 2012 0.0177947374 0.0834531229 6.088302 10396.854 1.0781934 -1.448994e-05 11.189 16.88556 42.53204
1049 Burundi 2013 -0.0094655827 0.0443990294 6.360665 10890.133 1.0653062 -1.141332e-05 11.472 17.73397 42.43866
1050 Burundi 2014 -0.0061952542 0.0121767175 6.637850 11389.026 1.0576664 -1.038206e-05 11.761 18.31099 42.42737
The error is
Error in solve.default(crossprod(WX, t(crossprod(WX, A1)))) :
Lapack routine dgesv: system is exactly singular: U[10,10] = 0
And sometimes after adjustments,i.e, data_test <- dplyr::filter(data_test,!is.na(rtfpna)), the error would become:
Error in solve.default(A1) :
system is computationally singular: reciprocal condition number = 1.14054e-16
or
Error in solve.default(crossprod(WX, t(crossprod(WX, A2)))) :
system is computationally singular: reciprocal condition number = 1.69599e-24
I guess the pgmm function 1) cannot handle well with the unbalanced dataframe as plm function, especially when the data contains 10% NA. 2)the solve function does not have a substitution to solve inverse matrix when the eigen value is too small. Also, according to my colleague who works mainly on Stata, Stata does not have such problem neither.So my question is, how to fix this problem, is my code heading the right way?
Any suggestion would be helpful.
Judging from the data you provided, this could cause the error: Your dataset it not balanced. It seems like your data for Burundi starts 1980, while Burkina Faso starts 1999.
I had the same error. In my dataset, I had the years 1891 to 1899, however 1892 was missing - I had forgotten to clean the data, so it is balanced. When I removed 1891, the problem was solved.
Intuitively this makes sense: The Sys-GMM uses high level lags to instrument the first lag. However, if years are randomly missing this obviously does not work consistently.
Of course, it could also be that you have highly correlated vars in your explaining variables.
I am working with daily temperature data that I have already run through R to pull out the first and last days of each year that are above a calculated threshold unique to each city dataset.
Data is brought into R in a .csv file with columns "YR", "number_of_days", "start_date", and "end_date". I only care about the "start_date" and "end_date" columns for this calculation.
For example, if I am looking at heat extremes, the first day of the year to have a temperature above 33 degrees C is May 1st and the last day of the year to have a temperature above 33 degrees C is October 20th. I do not care what the temperatures of the days in between are, just the start and end dates.
I want to convert the "May 1st" to an absolute number to be compared to other years. Below is sample data from BakersfieldTMAXextremes data.frame:
YR number_of_days start_date end_date
1900 27 5/22/00 10/18/00
1901 42 6/29/01 10/22/01
1902 76 6/7/02 9/23/02
1903 97 5/6/03 10/18/03
1904 98 4/8/04 9/15/04
1905 115 5/11/05 10/10/05
1906 90 4/20/06 10/27/06
1907 97 5/27/07 10/10/07
1908 107 4/11/08 9/16/08
1909 106 5/2/09 9/23/09
1910 89 4/18/10 10/15/10
1911 54 5/5/11 9/4/11
1912 51 5/31/12 10/18/12
1913 100 4/25/13 10/18/13
1914 78 4/19/14 10/14/14
1915 84 5/27/15 10/8/15
1916 73 5/5/16 9/28/16
1917 99 6/2/17 10/8/17
1918 81 6/2/18 10/13/18
1919 85 5/28/19 9/26/19
1920 61 5/17/20 9/30/20
1921 85 6/5/21 11/3/21
1922 91 5/14/22 9/25/22
1923 67 5/9/23 9/17/23
1924 91 5/8/24 9/29/24
1925 70 5/3/25 9/24/25
1926 84 4/25/26 9/9/26
1927 77 4/25/27 10/20/27
1928 88 5/5/28 10/9/28
1929 91 5/22/29 10/23/29
1930 86 5/23/30 10/7/30
1931 91 4/20/31 9/26/31
1932 82 5/11/32 10/5/32
1933 93 5/27/33 10/7/33
1934 101 4/20/34 10/12/34
1935 93 5/21/35 10/11/35
1936 85 5/10/36 9/26/36
For example, I would like to see the first start date as 141 (because it is the 141st day out of the 365 days in a year). At this point I couldn't care less about leap years, so we'll pretend they don't exist. I want the output in a table with the "YR", "start_date", and "end_date" (except with absolute values). For the first one, I would want "1900", "141" and "291" as the output.
I've tried to do this with an if-else statement, but it seems cumbersome to do for 365 days of the year (also I am fairly new to R and only have experience doing this in MATLAB). Any help is greatly appreciated!
Based on this answer, you can modify your data frame as follows:
library(lubridate)
df$start_date <- yday(df$start_date)
df$end_date <- yday(df$end_date)
I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)
I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco
Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.
I was wondering if you could help me devise an effortless way to code this country-year event data that I'm using.
In the example below, each row corresponds with an ongoing event (that I will eventually fold into a broader panel data set, which is why it looks bare now). So, for example, country 29 had the onset of an event in 1920, which continued (and ended) in 1921. Country 23 had the onset of the event in 1921, which lasted until 1923. Country 35 had the onset of an event that occurred in 1921 and only in 1921, et cetera.
country year
29 1920
29 1921
23 1921
23 1922
23 1923
35 1921
64 1926
135 1928
135 1929
135 1930
135 1931
135 1932
135 1933
135 1934
120 1930
70 1932
What I want to do is create "onset" and "ongoing" variables. The "ongoing" variable in this sample data frame would be easy. Basically: Data$ongoing <- 1
I'm more interested in creating the "onset" variable. It would be coded as 1 if it marks the onset of the event for the given country. Basically, I want to create a variable that looks like this, given this example data.
country year onset
29 1920 1
29 1921 0
23 1921 1
23 1922 0
23 1923 0
35 1921 1
64 1926 1
135 1928 1
135 1929 0
135 1930 0
135 1931 0
135 1932 0
135 1933 0
135 1934 0
120 1930 1
70 1932 1
If you can think of effortless ways to do this in R (that minimizes the chances of human error when working with it in a spreadsheet program like Excel), I'd appreciate it. I did see this related question, but this person's data set doesn't look like mine and it may require a different approach.
Thanks. Reproducible code for this example data is below.
country <- c(29,29,23,23,23,36,64,135,135,135,135,135,135,135,120,70)
year <- c(1920,1921,1921,1922,1923,1921,1926,1928,1929,1930,1931,1932,1933,1934,1930,1932)
Data=data.frame(country=country,year=year)
summary(Data)
Data
This should work, even with multiple onsets per country:
Data$onset <- with(Data, ave(year, country, FUN = function(x)
as.integer(c(TRUE, tail(x, -1L) != head(x, -1L) + 1L))))
You could also do this:
library(data.table)
setDT(Data)[, onset := (min(country*year)/country == year) + 0L, country]
This could be very fast when you have a larger dataset.