I need to remove duplicates from a big data frame that has 100 million rows. I am testing if data.table can help me on that. However, in the following code, unique() in data.table did not generate the same result as the unique() for data.frame. Is there a possible bug in setkey in data.table?
library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521,
1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659,
1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241,
1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126,
1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805,
1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493,
1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id
Results from the unique(tmp$id):
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Result from DTU$id:
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Comparing the two, we see that 1000001693466 got dropped in DTU by mistake. Any suggestions on why? I suspect it's the setkey because when I subtracted the 1000000000000 from all the numbers, the result is the same.
Edit (from Arun): The default rounding feature has been removed in the current development version of data.table, v1.9.7, and is likely stay that way moving forward. See here for installation instructions.
This also means that you're fully responsible for understanding limitations in representing floating point numbers and dealing with them :-).
help(setkey) says (data.table version 1.9.6):
Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by default, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. Have a look at setNumericRounding to learn more.
By changing rounding to 1 byte before keying
DT <- data.table(tmp)
setNumericRounding(1) # set rounding
setkey(DT, id)
the value no longer will be dropped.
However, help(setNumericRounding) says
For large numbers (integers > 2^31), we recommend using bit64::integer64 rather than setting rounding to 0.
Related
In R, I have the following vector of numbers:
numbers <- c(0.0193738397702257, 0.0206218006695066, 0.021931558829559,
0.023301378178208, 0.024728095594751, 0.0262069239112787, 0.0277310799996657,
0.0292913948762414, 0.0308758879014822, 0.0324693108459748, 0.0340526658271053,
0.03560271425176, 0.0370915716288017, 0.0384863653635563, 0.0397490272396821,
0.0408363289939899, 0.0417002577578561, 0.0422890917131629, 0.0425479537267193,
0.0424213884467212, 0.0418571402964338, 0.0408094991140723, 0.039243951482081,
0.0371450856007627, 0.0345208537496488, 0.0314091884865658, 0.0278854381969885,
0.0240607638577763, 0.0200808932436969, 0.0161193801903312, 0.0123615428382314,
0.00920410652651576, 0.00628125319205829, 0.0038816517651031,
0.00214210795679701, 0.00103919307280354, 0.000435532895812429,
0.000154730641092234, 4.56593150728962e-05, 1.09540661898799e-05,
2.08952167815574e-06, 3.10045314287095e-07, 3.51923218134997e-08,
3.02121734299694e-09, 1.95269500257237e-10, 9.54697530552714e-12,
3.5914029230041e-13, 1.07379981978647e-14, 2.68543048763588e-16,
6.03891613157815e-18, 1.33875697089866e-19, 3.73885699170518e-21,
1.30142752487978e-22, 5.58607581840324e-24, 2.92551478380617e-25,
1.85002124085815e-26, 1.39826890505611e-27, 1.25058972437096e-28,
1.31082961467944e-29, 1.59522437605631e-30, 2.23371981458205e-31,
3.5678974253211e-32, 6.44735482309705e-33, 1.30771083084868e-33,
2.95492180915218e-34, 7.3857554006177e-35, 2.02831084124162e-35,
6.08139499028838e-36, 1.97878175996974e-36, 6.94814886769478e-37,
2.61888070029751e-37, 1.05433608968287e-37, 4.51270543356897e-38,
2.04454840598946e-38, 9.76544451781597e-39, 4.90105271869773e-39,
2.5743371658684e-39, 1.41165292292001e-39, 8.06250933233367e-40,
4.78746160076622e-40, 2.94835809615626e-40, 1.87667170875529e-40,
1.22833908072915e-40, 8.21091993733535e-41, 5.53869254991177e-41,
3.74485710867631e-41, 2.52485401054841e-41, 1.69027430542613e-41,
1.12176290106797e-41, 7.38294520887852e-42, 4.8381070000246e-42,
3.20123319815522e-42, 2.16493953538386e-42, 1.50891804884267e-42,
1.09057070511506e-42, 8.1903023226717e-43, 6.3480235351625e-43,
5.13533594742621e-43, 4.25591269645348e-43, 3.57422485839717e-43,
3.0293235331048e-43, 2.58514651313175e-43, 2.21952686649801e-43,
1.91634521841049e-43, 1.66319240529025e-43, 1.45043336371471e-43,
1.27052593975384e-43, 1.11752052211757e-43, 9.86689196888877e-44,
8.74248543892126e-44)
I use cumsum to get the cumulative sum. Due to R's numerical precision, many of the numbers towards the end of the vector are now equivalent to 1 (even though technically they're not exactly = 1, just very close to it).
So then when I try to recover my original numbers by using diff(cumulative), I get a lot of 0s instead of a very small number. How can I prevent R from "rounding"?
cumulative <- cumsum(numbers)
diff(cumulative)
I think the Rmpfr package does what you want:
library(Rmpfr)
x <- mpfr(numbers,200) # set arbitrary precision that's greater than R default
cumulative <- cumsum(x)
diff(cumulative)
Here's the top and bottom of the output:
> diff(cumulative)
109 'mpfr' numbers of precision 200 bits
[1] 0.02062180066950659862445860426305443979799747467041015625
[2] 0.021931558829559001655429284483034280128777027130126953125
[3] 0.02330137817820800150148130569505156017839908599853515625
[4] 0.0247280955947510004688805196337852976284921169281005859375
...
[107] 1.117520522117570086014450710640040701536080790307716261438975e-43
[108] 9.866891968888769759087690539062888824928577731689952701181586e-44
[109] 8.742485438921260418707338389502002282130643811990663213422948e-44
You can adjust the precision as you like by changing the second argument to mpfr.
You might want to try out the package Rmpfr.
I have some codes like this example, if you run these codes
library(hurricaneexposure)
library(hurricaneexposuredata)
data("hurr_tracks")
storms <- unique(hurr_tracks$storm_id)
storms
then you will see that "storms" has a long string list with "stormname-year" structure.
[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988" "Gilbert-1988" "Keith-1988" "Allison-1989" "Chantal-1989"
[9] "Hugo-1989" "Jerry-1989" "Bertha-1990" "Marco-1990" "Ana-1991" "Bob-1991" "Fabian-1991" "Notnamed-1991"
[17] "Andrew-1992" "Danielle-1992" "Earl-1992" "Arlene-1993" "Emily-1993" "Alberto-1994" "Beryl-1994" "Gordon-1994"
[25] "Allison-1995" "Dean-1995" "Erin-1995" "Gabrielle-1995" "Jerry-1995" "Opal-1995" "Arthur-1996" "Bertha-1996"
[33] "Edouard-1996" "Fran-1996" "Josephine-1996" "Subtrop-1997" "Ana-1997" "Danny-1997" "Bonnie-1998" "Charley-1998"
[41] "Earl-1998" "Frances-1998" "Georges-1998" "Hermine-1998" "Mitch-1998" "Bret-1999" "Dennis-1999" "Floyd-1999"
[49] "Harvey-1999" "Irene-1999" "Beryl-2000" "Gordon-2000" "Helene-2000" "Leslie-2000" "Allison-2001" "Barry-2001"
My question is how to split these elements based on same year. For example, I want to create a new variable "y1988" which is a list that has all storms in 1998. If I run y1988, it will output:
y1988
[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988" "Gilbert-1988" "Keith-1988"
So as for y1989 until 2001. I am guessing it might use gsub() and a for-loop,however, I am a rookie in R, so really hope you could give me some suggestion.
We can use split with grouping variable created by removing the prefix substring including the - with sub.
lst <- split(storms, sub(".*-", "", storms))
lst$`1988`
#[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988"
#[5] "Gilbert-1988" "Keith-1988"
data
storms <- c("Alberto-1988", "Beryl-1988", "Chris-1988", "Florence-1988",
"Gilbert-1988", "Keith-1988", "Allison-1989", "Chantal-1989",
"Hugo-1989", "Jerry-1989", "Bertha-1990", "Marco-1990", "Ana-1991",
"Bob-1991", "Fabian-1991", "Notnamed-1991", "Andrew-1992", "Danielle-1992",
"Earl-1992", "Arlene-1993", "Emily-1993", "Alberto-1994", "Beryl-1994",
"Gordon-1994", "Allison-1995", "Dean-1995", "Erin-1995", "Gabrielle-1995",
"Jerry-1995", "Opal-1995", "Arthur-1996", "Bertha-1996", "Edouard-1996",
"Fran-1996", "Josephine-1996", "Subtrop-1997", "Ana-1997", "Danny-1997",
"Bonnie-1998", "Charley-1998", "Earl-1998", "Frances-1998", "Georges-1998",
"Hermine-1998", "Mitch-1998", "Bret-1999", "Dennis-1999", "Floyd-1999",
"Harvey-1999", "Irene-1999", "Beryl-2000", "Gordon-2000", "Helene-2000",
"Leslie-2000", "Allison-2001", "Barry-2001")
Why don't you extract the year directly within your original dataframe? libraries dplyr and tidyr are well suited for problems like this.
I suggest the following:
library(dplyr)
library(tidyr)
hurr_tracks %>%
extract(storm_id, c("storm", "year"),"(.+)-(.+)")
Alternative way using stringr
split(storms,str_extract(storms,"[0-9]+"))
I have several vectors to combine into a named list ("my_list"). The names of the vectors are already stored in the vector ("zI").
> zI
[1] "Chemokines" "Cell_Cycle" "Regulation"
[4] "Senescence" "B_cell_Functions" "T_Cell_Functions"
[7] "Cell_Functions" "Adhesion" "Transporter_Functions"
[10] "Complement" "Pathogen_Defense" "Cytokines"
[13] "Antigen_Processing" "Leukocyte_Functions" "TNF_Superfamily"
[16] "Macrophage_Functions" "Microglial_Functions" "Interleukins"
[19] "Cytotoxicity" "NK_Cell_Functions" "TLR"
If it's a small number of vectors, I'd simply do
my_list <- setNames(list(Chemokines, Adhesion), c("Chemokines", "Adhesion"))
I'd like to find a smarter way, other than to combine the vector names into a long string and then copying/pasting.
> toString(zI)
[1] "Chemokines, Cell_Cycle, Regulation, Senescence, B_cell_Functions, T_Cell_Functions, Cell_Functions, Adhesion, Transporter_Functions, Complement, Pathogen_Defense, Cytokines, Antigen_Processing, Leukocyte_Functions, TNF_Superfamily, Macrophage_Functions, Microglial_Functions, Interleukins, Cytotoxicity, NK_Cell_Functions, TLR"
> my_lists <- list(Chemokines, Cell_Cycle, Regulation, Senescence, B_cell_Functions, T_Cell_Functions, Cell_Functions, Adhesion, Transporter_Functions, Complement, Pathogen_Defense, Cytokines, Antigen_Processing, Leukocyte_Functions, TNF_Superfamily, Macrophage_Functions, Microglial_Functions, Interleukins, Cytotoxicity, NK_Cell_Functions, TLR)
> my_lists <- setNames(my_lists, zI)
This is probably a really fundamental question, but I've searched and read about 10 separate threads and still can't figure it out. Much thanks for any help!
We can use mget to get the values of the character strings.
mget(zI)
I want to create a vector of names that act as variable names so I can then use themlater on in a loop.
years=1950:2012
for(i in 1:length(years))
{
varname[i]=paste("mydata",years[i],sep="")
}
this gives:
> [1] "mydata1950" "mydata1951" "mydata1952" "mydata1953" "mydata1954" "mydata1955" "mydata1956" "mydata1957" "mydata1958"
[10] "mydata1959" "mydata1960" "mydata1961" "mydata1962" "mydata1963" "mydata1964" "mydata1965" "mydata1966" "mydata1967"
[19] "mydata1968" "mydata1969" "mydata1970" "mydata1971" "mydata1972" "mydata1973" "mydata1974" "mydata1975" "mydata1976"
[28] "mydata1977" "mydata1978" "mydata1979" "mydata1980" "mydata1981" "mydata1982" "mydata1983" "mydata1984" "mydata1985"
[37] "mydata1986" "mydata1987" "mydata1988" "mydata1989" "mydata1990" "mydata1991" "mydata1992" "mydata1993" "mydata1994"
[46] "mydata1995" "mydata1996" "mydata1997" "mydata1998" "mydata1999" "mydata2000" "mydata2001" "mydata2002" "mydata2003"
[55] "mydata2004" "mydata2005" "mydata2006" "mydata2007" "mydata2008" "mydata2009" "mydata2010" "mydata2011" "mydata2012"
All I want to do is remove the quotes and be able to call each value individually.
I want:
>[1] mydata1950 mydata1951 mydata1952 mydata1953, #etc...
stored as a variable such that
varname[1]
> mydata1950
varname[2]
> mydata1951
and so on.
I have played around with
cat(varname[i],"\n")
but this just prints values as one line and I can't call each individual string. And
gsub("'",'',varname)
but this doesn't seem to do anything.
Suggestions? Is this possible in R? Thank you.
There are no quotes in that character vector's values. Use:
cat(varname)
.... if you want to see the unquoted values. The R print mechanism is set to use quotes as a signal to your brain that distinct values are present. You can also use:
print(varname, quote=FALSE)
If there are that many named objects in you workspace, then you need desperately to learn to use lists. There are mechanisms for "promoting" character values to names, but this would be seen as a failure on your part to learn to use the language effectively:
var <- 2
> eval(as.name('var'))
[1] 2
> eval(parse(text="var"))
[1] 2
> get('var')
[1] 2
I tried to find two values in the following vector, which are close to 10. The expected value is 10.12099196 and 10.63054170. Your inputs would be appreciated.
[1] 0.98799517 1.09055728 1.20383713 1.32927166 1.46857509 1.62380423 1.79743107 1.99241551 2.21226576 2.46106916 2.74346924 3.06455219 3.42958354 3.84350238 4.31005838
[16] 4.83051356 5.40199462 6.01590035 6.65715769 7.30532785 7.93823621 8.53773241 9.09570538 9.61755743 10.12099196 10.63018180 11.16783243 11.74870531 12.37719092 13.04922392
[31] 13.75661322 14.49087793 15.24414627 16.00601247 16.75709565 17.46236358 18.06882072 18.51050094 18.71908344 18.63563523 18.22123225 17.46709279 16.40246292 15.09417699 13.63404124
[46] 12.11854915 10.63054170 9.22947285 7.95056000 6.80923943 5.80717982 4.93764782 4.18947450 3.54966795 3.00499094 2.54283599 2.15165780 1.82114213 1.54222565 1.30703661
[61] 1.10879707 0.94170986 0.80084308 0.68201911 0.58171175 0.49695298 0.42525021 0.36451350 0.31299262 0.26922281 0.23197860 0.20023468 0.17313291 0.14995459 0.13009730
[76] 0.11305559 0.09840485 0.08578789 0.07490387 0.06549894 0.05735864
Another alternative could be allowing the user to control for the "tolerance" in order to set what "closeness" is, this can be done by using a simple function:
close <- function(x, value, tol=NULL){
if(!is.null(tol)){
x[abs(x-10) <= tol]
} else {
x[order(abs(x-10))]
}
}
Where x is a vector of values, value is the value of comparison for closeness, and tol is logical, if it's NULL it returns all the "close" values ordered by "closeness" to value, otherwise it returns just the values meeting the condition given in tol.
> close(x, value=10, tol=.7)
[1] 9.617557 10.120992 10.630182 10.630542
> close(x, value=10)
[1] 10.12099196 9.61755743 10.63018180 10.63054170 9.22947285 9.09570538 11.16783243
[8] 8.53773241 11.74870531 7.95056000 7.93823621 12.11854915 12.37719092 7.30532785
[15] 13.04922392 6.80923943 6.65715769 13.63404124 13.75661322 6.01590035 5.80717982
[22] 14.49087793 5.40199462 4.93764782 15.09417699 4.83051356 15.24414627 4.31005838
[29] 4.18947450 16.00601247 3.84350238 16.40246292 3.54966795 3.42958354 16.75709565
[36] 3.06455219 3.00499094 2.74346924 2.54283599 17.46236358 17.46709279 2.46106916
[43] 2.21226576 2.15165780 1.99241551 18.06882072 1.82114213 1.79743107 18.22123225
[50] 1.62380423 1.54222565 18.51050094 1.46857509 18.63563523 1.32927166 1.30703661
[57] 18.71908344 1.20383713 1.10879707 1.09055728 0.98799517 0.94170986 0.80084308
[64] 0.68201911 0.58171175 0.49695298 0.42525021 0.36451350 0.31299262 0.26922281
[71] 0.23197860 0.20023468 0.17313291 0.14995459 0.13009730 0.11305559 0.09840485
[78] 0.08578789 0.07490387 0.06549894 0.05735864
In the first example I defined "closeness" to be at most a difference of 0.7 between value and each elements in x. In the second example the function close returns a vector of values where the firsts are the closest to the value given in value and the lasts are the farest values from value.
Since my solution does not provide an easy (practical) way to find tol as #Arun pointed out, one way to find the closest values would be seting tol=NULL and asking for the exact number of close values as in:
> close(x, value=10)[1:3]
[1] 10.120992 9.617557 10.630182
This shows the three values in x closest to 10.
I can't think of a way without using sort. However, you can speed it up by using partial sort.
x[abs(x-10) %in% sort(abs(x-10), partial=1:2)[1:2]]
# [1] 9.617557 10.120992
In case the same values are present more than once, you'll get all of them here. So, you can either wrap this with unique or you can use match instead as follows:
x[match(sort(abs(x-10), partial=1:2)[1:2], abs(x-10))]
# [1] 10.120992 9.617557
dput output:
dput(x)
c(0.98799517, 1.09055728, 1.20383713, 1.32927166, 1.46857509,
1.62380423, 1.79743107, 1.99241551, 2.21226576, 2.46106916, 2.74346924,
3.06455219, 3.42958354, 3.84350238, 4.31005838, 4.83051356, 5.40199462,
6.01590035, 6.65715769, 7.30532785, 7.93823621, 8.53773241, 9.09570538,
9.61755743, 10.12099196, 10.6301818, 11.16783243, 11.74870531,
12.37719092, 13.04922392, 13.75661322, 14.49087793, 15.24414627,
16.00601247, 16.75709565, 17.46236358, 18.06882072, 18.51050094,
18.71908344, 18.63563523, 18.22123225, 17.46709279, 16.40246292,
15.09417699, 13.63404124, 12.11854915, 10.6305417, 9.22947285,
7.95056, 6.80923943, 5.80717982, 4.93764782, 4.1894745, 3.54966795,
3.00499094, 2.54283599, 2.1516578, 1.82114213, 1.54222565, 1.30703661,
1.10879707, 0.94170986, 0.80084308, 0.68201911, 0.58171175, 0.49695298,
0.42525021, 0.3645135, 0.31299262, 0.26922281, 0.2319786, 0.20023468,
0.17313291, 0.14995459, 0.1300973, 0.11305559, 0.09840485, 0.08578789,
0.07490387, 0.06549894, 0.05735864)
I'm not sure your question is clear, so here's another approach. To find the value closest to your first desired value, 10.12099196 , subtract that from the vector, take the absolute value, and then find the index of the closest element. Explicit:
delx <- abs( 10.12099196 - x)
min.index <- which.min(delx) #returns index of first minimum if there are duplicates
x[min.index] #gets you the value itself
Apologies if this was not the intent of your question.