dummy_cols Error: vector memory exhausted (limit reached?) - r

I am attempting to create dummy variables based on a factor variable with more than 200 factor levels. The data has more than 15 million observations. Using the "fastDummies" package, I am using the "dummy_cols" command to convert the factor variable to dummies, and remove the first.
I have read numerous posts on the issue. Several suggest subsetting the data, which I cannot do. This analysis is for a school assignment that requires that I use all included data.
I am using a 16GB RAM Macbook Pro with the 64-bit version of RStudio. The instructions in the post below dictate how to increase the max RAM available to R. However, the instructions seem to imply that I am already at maximum capacity, or that it may be unsafe for my machine to attempt to raise the memory restriction on R.
R on MacOS Error: vector memory exhausted (limit reached?)
I'm not sure how to go about posting 15 million rows of data. The following code shows the unique factor levels for the variable in question:
unique(housing$city)
[1] 40 80 160 200 220 240 280 320 440 450 460 520 560 600 640
[16] 680 720 760 840 860 870 880 920 960 1000 1040 1080 1120 1121 1122
[31] 1150 1160 1240 1280 1320 1360 1400 1440 1520 1560 1600 1602 1620 1640 1680
[46] 1720 1740 1760 1840 1880 1920 1950 1960 2000 2020 2040 2080 2120 2160 2240
[61] 2290 2310 2320 2360 2400 2560 2580 2640 2670 2680 2700 2760 2840 2900 2920
[76] 3000 3060 3080 3120 3160 3180 3200 3240 3280 3290 3360 3480 3520 3560 3590
[91] 3600 3620 3660 3680 3710 3720 3760 3800 3810 3840 3880 3920 3980 4000 4040
[106] 4120 4280 4320 4360 4400 4420 4480 4520 4600 4680 4720 4800 4880 4890 4900
[121] 4920 5000 5080 5120 5160 5170 5200 5240 5280 5360 5400 5560 5600 5601 5602
[136] 5603 5605 5790 5800 5880 5910 5920 5960 6080 6120 6160 6200 6280 6440 6520
[151] 6560 6600 6640 6680 6690 6720 6740 6760 6780 6800 6840 6880 6920 6960 6980
[166] 7040 7080 7120 7160 7240 7320 7360 7362 7400 7470 7480 7500 7510 7520 7560
[181] 7600 7610 7620 7680 7800 7840 7880 7920 8000 8040 8050 8120 8160 8200 8280
[196] 8320 8400 8480 8520 8560 8600 8640 8680 8730 8760 8780 8800 8840 8880 8920
[211] 8940 8960 9040 9080 9140 9160 9200 9240 9260 9280 9320
I used the following commands to create the dummy variables, based on the fastDummies package:
library(fastDummies)
housing <- dummy_cols(housing, select_columns = "city", remove_first_dummy = TRUE)
I get the following response:
Error: vector memory exhausted (limit reached?)
I am, again, trying to create 220 dummies based on the 221 levels (excluding the first to avoid issues of perfect collinearity in analyses).
Any help is most welcome. If I am missing something about the preceding suggestions, my apologies; none of them involved the exact issue I am experiencing (in the context of creating dummies) and I am not very proficient in use of the command line in Mac OS.

An update: I used the method in R on MacOS Error: vector memory exhausted (limit reached?) of using Terminal to remove the default memory usage cap allotted to R, and was able to perform the operations I needed to (although they took an extremely long time).
However, I am still concerned that this may be problematic for computing power. Could removing these default limits on memory used by R damage my computer? Activity Monitor says that my rsession is using almost 48 GB of RAM when I only have physical memory on board of 16 GB.
I understand that I may be walking the line between a coding and a software question here, but the two in this case are related.

Related

Regex for a word followed by multiple numbers in a sequence

I am trying to create a general regular expression that returns a boolean if it sees one word followed but more than one set of numbers.
If should report a TRUE for the following test cases:
"AB 40 256 556 1144 1296 1496 1722 1847 1915 1979 2018 2056 2106 2240 2294 2394 2539 2587 2660"
"SB 466 848 929 1339 1554 1761 1807 1828 1852 1875 1899 1922 1940 1968 2007 2046 2074 2075 2158"
"Assembly 772 1604 1932 2187 2543 2759 2777"
"Senate 241 1110 1342 1822 1865 1957"
And FALSE for the following cases:
"ACR 105"
"SJR 29"
"AB 2359 AB 2456 and AB 2823"
"CDFA Budget for Pierce's Disease"
"PERS, STRS, Regents"
If you can provide two answers: one regular expression looking for the letters and the numbers and another answer looking for multiple numbers back to back, I would greatly appreciate it.
Thank you so much for your help!
Try this.
grepl('\\D+\\d+\\s\\d+', x)
# [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Explanation
\D+ one or more non-digit
\d+ one or more digit
\s whitespace

hts method for creating hierarchical time series

I am trying to convert a time series into a hierarchical one using the hts package and my structure is as follows:
My data is already stored as a ts object in R with a weekly frequency and all columns are labeled:
Then I try to make the hts object but I get this error message, probably because I do not fully understand how to split the columns in the argument "characters".
x <- hts(abc, c(1,1,2,2))
Since argument characters are not specified, the default labelling system is used.
Error: Argument nodes must be a list.
Thanks in advance
First you need to use the argument characters. It is not the second argument of hts(). Second, you need to specify characters according to the names of your columns. In this case, the first character is the first level of disaggregation (A, B, C, D, E), while the next two characters specify the next level of disaggregation. So your argument should be characters = c(1,2).
Here is an example with synthetic data, but with the same structure as yours.
library(hts)
abc <- matrix(sample(1:100, 32*140, replace=TRUE), ncol=32)
colnames(abc) <- c(
paste0("A0",1:5),
paste0("B0",1:9),"B10",
paste0("C0",1:8),
paste0("D0",1:5),
paste0("E0",1:4)
)
abc <- ts(abc, start=2019, frequency=365.25/7)
x <- hts(abc, characters = c(1,2))
x
#> Hierarchical Time Series
#> 3 Levels
#> Number of nodes at each level: 1 5 32
#> Total number of series: 38
#> Number of observations per series: 140
#> Top level series:
#> Time Series:
#> Start = 2019
#> End = 2021.66392881588
#> Frequency = 52.1785714285714
#> [1] 1735 1645 1472 1638 1594 1722 1525 1761 1500 1746 1331 1567 1853 1652 1587
#> [16] 1540 1453 1989 1629 1587 1596 1474 1320 1599 1762 1419 1931 1447 2102 1608
#> [31] 1439 1909 1331 1742 1428 1677 1534 1657 1741 1612 1574 1954 1542 2067 1512
#> [46] 1850 1650 1666 1321 1332 1924 1786 1496 1695 1363 1437 1740 1448 1260 1371
#> [61] 1661 1726 1786 1641 1463 1616 1641 1895 1503 1430 1972 1705 1722 1447 1515
#> [76] 1636 1544 1727 1960 1647 1682 1569 1616 1628 1706 1837 1738 1659 1574 1716
#> [91] 1409 1428 1411 1708 1606 1501 1413 1707 1552 1567 1693 1748 2034 1557 1402
#> [106] 1649 1637 1653 1857 1401 1519 1600 1844 1585 1796 1612 1456 1626 1390 1368
#> [121] 1492 1765 1644 1773 1302 2027 1810 1652 1819 1628 1574 1655 1650 1817 1605
#> [136] 1422 1793 1999 1489 1667
Created on 2021-10-18 by the reprex package (v2.0.1)

Find groups of thousands which sum to a given number, in lexical order

A large number can be comma formatted to read more easily into groups of three. E.g. 1050 = 1,050 and 10200 = 10,200.
The sum of each of these groups of three would be:
1050=1,050 gives: 1+50=51
10200=10,200 gives: 10+200=210
I need to search for matches in the sum of the groups of threes.
Namely, if I am searching for 1234, then I am looking for numbers whose sum of threes = 1234.
The smallest match is 235,999 since 235+999=1234. No other integer less than 235,999 gives a sum of threes equal to 1234.
The next smallest match is 236,998 since 236+998=1234.
One can add 999 each time, but this fails after reaching 999 since an extra digit of 1 is added to the number due to overflow in the 999.
More generally, I am asking for the solutions (smallest to highest) to:
a+b+c+d… = x
where a,b,c,d… is an arbitrary number of integers between 0-999 and x
is a fixed integer
Note there are infinite solutions to this for any positive integer x.
How would one get the solutions to this beginning with the smallest number solutions (for y number of solutions where y can be an arbitrarily large number)?
Is there a way to do this without brute force looping one by one? I'm dealing with potentially very large numbers, which could take years to loop through in a straight loop. Ideally, one should do this without failed attempts.
The problem is easier to think about if instead of groups of 3 digits, you just consider 1 digit at a time.
An algorithm:
Start by filling the 0 digit group with x.
Create a loop that each time prints the next solution.
"Normalize" the groups by moving all that is too large from the right to the left, leaving only the maximum value at the right.
Output the solution
Repeat:
Add 1 to the penultimate group
This can carry to the left if a group gets too large (e.g.999+1 is too large)
Check whether the result didn't get too large (a[0] should be able to absorb what was added)
If the result got too large, set the group to zero and continue incrementing the earlier groups
Calculate the last group to absorb the surplus (can be positive or negative)
Some Python code for illustration:
x = 1234
grouping = 3
max_iterations = 200
max_in_group = 10**grouping - 1
a = [x]
while max_iterations > 0:
#step 1: while a[0] is too large: redistribute to the left
i = 0
while a[i] > max_in_group:
if i == len(a) - 1:
a.append(0)
a[i + 1] += a[i] - max_in_group
a[i] = max_in_group
i += 1
num = sum(10**(grouping*i) * a[i] for i, n in enumerate(a))
print(f"{num} {num:,}")
# print("".join([str(t) for t in a[::-1]]), ",".join([str(t) for t in a[::-1]]))
# step 2: add one to the penultimate group, while group already full: set to 0 and increment the
# group left of it;
# while the surplus is too large (because a[0] is too small) repeat the incrementing
i0 = 1
surplus = 0
while True: # needs to be executed at least once, and repeated if the surplus became too large
i = i0
while True: # increment a[i] by 1, which can carry to the left
if i == len(a):
a.append(1)
surplus += 1
break
else:
if a[i] == max_in_group:
a[i] = 0
surplus -= max_in_group
i += 1
else:
a[i] += 1
surplus += 1
break
if a[0] >= surplus:
break
else:
surplus -= a[i0]
a[i0] = 0
i0 += 1
#step 3: a[0] should absorb the surplus created in step 1, although a[0] can get out of bounds
a[0] -= surplus
surplus = 0
max_iterations -= 1
Abbreviated output:
235,999 236,998 ... 998,236 999,235 ... 1,234,999 1,235,998 ... 1,998,235 1,999,234 2,233,999 2,234,998 ...
Output for grouping=3 and x=3456:
459,999,999,999 460,998,999,999 460,999,998,999 460,999,999,998 461,997,999,999
461,998,998,999 461,998,999,998 461,999,997,999 461,999,998,998 461,999,999,997
462,996,999,999 ...
Output for grouping=1 and x=16:
79 88 97 169 178 187 196 259 268 277 286 295 349 358 367 376 385 394 439 448 457 466
475 484 493 529 538 547 556 565 574 583 592 619 628 637 646 655 664 673 682 691 709
718 727 736 745 754 763 772 781 790 808 817 826 835 844 853 862 871 880 907 916 925
934 943 952 961 970 1069 1078 1087 1096 1159 1168 1177 1186 1195 1249 1258 1267 1276
1285 1294 1339 1348 1357 1366 1375 1384 1393 1429 1438 1447 1456 1465 1474 1483 1492
1519 1528 1537 1546 1555 1564 1573 1582 1591 1609 1618 1627 1636 1645 1654 1663 1672
1681 1690 1708 1717 1726 1735 1744 1753 1762 1771 1780 1807 1816 1825 1834 1843 1852
1861 1870 1906 1915 1924 1933 1942 1951 1960 2059 2068 2077 2086 2095 2149 2158 2167
2176 2185 2194 2239 2248 2257 2266 2275 2284 2293 2329 2338 2347 2356 2365 2374 2383
2392 2419 2428 2437 2446 2455 2464 2473 2482 2491 2509 2518 2527 2536 2545 2554 2563
2572 2581 2590 2608 2617 2626 2635 2644 2653 2662 2671 2680 2707 2716 2725 2734 ...

How to call a variable in loops of R? (create arrays as dictionary)

I'd like to define a series of variables in a for loop. (create a array as dictionary. Convert tops to d1 as shown below)
Firstly, I assign values to them (d1~d11);
then I try to define the names of these variables.
How should I call specific variables in the names() function to make it work like "names(d1)<-..."
for (i = 1:11)
{
assign(paste("d",i,sep=""),tops[,2*i])
names(eval(parse(text=paste("d",i,sep=""))))<-tops[,2*i-1]
}
> tops[,c(1,2)]
V1 V2 V3 V4 V5 V6
1 shift 2136 shift 2211 shift 2324
2 bed 1463 k 1551 plant 1664
3 run 1338 bed 1527 run 1466
4 plant 1309 run 1504 k 1456
5 k 1294 hr 1484 bed 1390
6 hr 1285 clean 1464 hr 1366
7 check 1255 plant 1386 clean 1359
8 clean 1203 check 1261 s 1254
9 s 1052 s 1205 check 1048
10 unload 1024 start 1115 end 1028
11 chang 1023 fine 1113 fine 1020
12 fine 960 chang 1104 start 1006
13 end 924 end 1050 chang 977
14 start 905 stop 974 stop 950
15 pellet 878 pellet 915 pellet 897
16 work 866 work 907 remov 874
17 due 856 screen 900 sinter 862
18 stop 853 bwr 888 side 841
19 complet 772 side 888 due 809
20 remov 750 due 861 conveyor 792
21 requir 726 complet 841 work 777
22 sinter 711 sinter 834 north 771
23 south 710 conveyor 775 south 760
24 side 688 north 768 west 738
25 issu 682 remov 764 belt 737
26 t 675 ok 759 carri 735
27 belt 672 t 753 screen 727
28 carri 668 requir 750 stock 725
29 strand 649 unload 749 unload 719
30 conveyor 646 chute 747 chute 688
> d1
shift bed run plant k hr check clean s
2136 1463 1338 1309 1294 1285 1255 1203 1052
unload chang fine end start pellet work due stop
1024 1023 960 924 905 878 866 856 853
complet remov requir sinter south side issu t belt
772 750 726 711 710 688 682 675 672
carri strand conveyor
668 649 646
> length(d1)
[1] 30
I hope I make it clear. if not, please free to ask me
As David mentioned, don't assign 11 different variables; create a list with 11 elements. This will simplify your code considerably.
d <- lapply(1:11, function(i) tops[, 2 * i = 1])

How to merge bins in R

So, I am trying to merge bins of a histogram whenever the number of observations in a bin is less than 6.
library(fitdistrplus)
mydata <-read.csv("Book2.csv",stringsAsFactors=FALSE)
QF3<-as.numeric(mydata[,1])
histrv<-hist(QF3,breaks="FD")
binvec<-data.frame(diff(histrv$breaks))
binbreak=histrv$breaks
freq<-histrv$count
datmean=as.numeric(mean(QF3))
datsigma=as.numeric(sd(QF3))
templist<-as.numeric()#empty list
for (i in 1:nrow(binvec)){
templist[i]=pnorm(binbreak[i+1],datmean,datsigma)-pnorm(binbreak[i],datmean,datsigma)
}
pi<-data.frame(templist)
chisqvec<-(freq-length(QF3)*pi)^2/(length(QF3)*pi)
xstat=sum(chisqvec)
The above code will provide a histogram with five bins that contain less than 6 observations, which are the bins 6000-7000, 7000-8000, 8000-9000, 9000-10000, and 10000-11000. Each of these 5 bins contain 2, 5, 2, 2, and 1 observations respectively. I would like to merge the bins that they can have more than 5 observations.
In other words, I would like to have the two bins 6000-8000 and 8000-11000 so that they can contain 7 observations and 5 observations.
Does anyone have any clue on how to approach this problem?
QF3 looks like the following:
> QF3
[1] 2016 1425 2000 785 823 2484 1870 770 1220 3454 1056 2745 2830
[14] 950 601 1245 2663 1500 1717 1070 1704 2517 1090 3310 3389 2200
[27] 882 2113 600 1900 4417 745 530 1630 1600 4530 948 2764 2202
[40] 1052 2685 1120 1275 2300 1590 1935 3957 4283 3215 5684 4092 7548
[53] 4547 3510 3063 5549 6460 5204 4626 4965 5023 8111 5525 4804 5994
[66] 8471 4767 7142 3420 4061 5102 9135 3861 5372 7274 5054 7318 3791
[79] 4901 3549 4758 4859 10190 5609 7624 5841 4908 4974 6691 5713 3235
[92] 4464 2656 4399 9581 3993 4061

Resources