Hello,
I created the dataframe below, based on the example in the sunburstR documentation.
Column Count
1: ACTIVE 68764
2: INACTIVE 73599
3: ACTIVE-RESIDENT 68279
4: ACTIVE-NONRESIDENT 485
5: INACTIVE-RESIDENT 63378
6: INACTIVE-NONRESIDENT 10221
7: ACTIVE-RESIDENT-LATIN 55
8: ACTIVE-RESIDENT-CYRLIC 68224
9: ACTIVE-NONRESIDENT-LATIN 465
10: ACTIVE-NONRESIDENT-CYRLIC 20
11: INACTIVE-RESIDENT-LATIN 114
12: INACTIVE-RESIDENT-CYRLIC 63264
13: INACTIVE-NONRESIDENT-LATIN 7915
14: INACTIVE-NONRESIDENT-CYRLIC 2306
The first column is character, the second is integer.
However when I try to plot it, I get nothing.
sunburst(sunburst_data)
Any hints whats wrong with the structure of my dataframe?
Include only the leaf nodes in your data frame...
df <- read.table(text = '
Column Count
ACTIVE-RESIDENT-LATIN 55
ACTIVE-RESIDENT-CYRLIC 68224
ACTIVE-NONRESIDENT-LATIN 465
ACTIVE-NONRESIDENT-CYRLIC 20
INACTIVE-RESIDENT-LATIN 114
INACTIVE-RESIDENT-CYRLIC 63264
INACTIVE-NONRESIDENT-LATIN 7915
INACTIVE-NONRESIDENT-CYRLIC 2306
')
library(sunburstR)
sunburst(df)
Related
This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 1 year ago.
I want to subset a dataframe with .id values specified but it gives me this error:
Warning in .id == c(3, 5:12, 14, 20:64, 66:72, 75, 78:79, 81:111, 113:136, :
longer object length is not a multiple of shorter object length
when using this code:
newdatarev = subset(newdata, .id == c(3,5:12,14,20:64,66:72,75,78:79,81:111,113:136,138:149,151:160,
162:183,185:225,227:233,235:247,249,251:264,266:328,330:364,366:383,
385:411,413:471,473:490,492:580,582:598,600:603,605:606,608:619,621:646,
648:686,688:718,720:746,748,750:753,755:762,764:861,863:875,877:894,
897:911,913:914,916:926,928:941))
For reference, here is a small bit of newdata:
> newdata
.id V1 V2
1: 1 -2.870109 8273.632
2: 1 4.829891 8273.632
3: 1 21.329891 8279.132
4: 1 25.729891 8281.332
5: 1 32.329891 8285.732
---
17937: 941 1834.113417 1411.605
17938: 941 1818.713417 1392.905
17939: 941 1814.313417 1386.305
17940: 941 1814.313417 1364.305
17941: 941 1828.613417 1224.605
I have a feeling it has to do with how .id is structured and me using the code interferes with how it interprets the rows vs. .id values that overlap. It does get me a result of a very strange recollection of data here:
> newdatarev
.id V1 V2
1: 55 158.8030 2045.753
2: 100 227.7387 8250.454
3: 153 356.8675 1383.835
4: 205 483.6464 3946.844
5: 299 635.8744 8387.862
6: 347 722.9303 5147.715
7: 393 850.1742 2115.559
8: 439 857.9288 8243.071
9: 482 926.5706 1608.928
10: 532 1107.8380 2616.635
11: 632 1234.6482 4957.055
12: 633 1201.8700 3252.570
13: 683 1315.2215 2068.050
14: 684 1325.5905 6253.692
15: 734 1414.3443 2267.337
16: 784 1551.0153 5184.641
17: 831 1634.2056 7159.362
18: 880 1724.5570 5726.908
19: 933 1879.6398 3465.536
Thank you in advance!
The == operator tests one condition against one other condition. What you want is to test several conditions all at once. This can be done with the %in% infix operator:
newdatarev <- subset(newdata, .id %in% c(3,5:12,14,20:64,66:72,75,78:79,81:111,113:136,138:149,151:160,
162:183,185:225,227:233,235:247,249,251:264,266:328,330:364,366:383,
385:411,413:471,473:490,492:580,582:598,600:603,605:606,608:619,621:646,
648:686,688:718,720:746,748,750:753,755:762,764:861,863:875,877:894,
897:911,913:914,916:926,928:941))
An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>
I want to create a new data.table or maybe just add some columns to a data.table. It is easy to specify multiple new columns but what happens if I want a third column to calculate a value based on one of the columns I am creating. I think plyr package can do something such as that. Can we perform such iterative (sequential) column creation in data.table?
I want to do as follows
dt <- data.table(shop = 1:10, income = 10:19*70)
dt[ , list(hope = income * 1.05, hopemore = income * 1.20, hopemorerealistic = hopemore - 100)]
or maybe
dt[ , `:=`(hope = income*1.05, hopemore = income*1.20, hopemorerealistic = hopemore-100)]
You can also use <- within the call to list eg
DT <- data.table(a=1:5)
DT[, c('b','d') := list(b1 <- a*2, b1*3)]
DT
a b d
1: 1 2 6
2: 2 4 12
3: 3 6 18
4: 4 8 24
5: 5 10 30
Or
DT[, `:=`(hope = hope <- a+1, z = hope-1)]
DT
a b d hope z
1: 1 2 6 2 1
2: 2 4 12 3 2
3: 3 6 18 4 3
4: 4 8 24 5 4
5: 5 10 30 6 5
It is possible by using curly braces and semicolons in j
There are multiple ways to go about it, here are two examples:
# If you simply want to output:
dt[ ,
{hope=income*1.05;
hopemore=income*1.20;
list(hope=hope, hopemore=hopemore, hopemorerealistic=hopemore-100)}
]
# if you want to save the values
dt[ , c("hope", "hopemore", "hopemorerealistic") :=
{hope=income*1.05;
hopemore=income*1.20;
list(hope, hopemore, hopemore-100)}
]
dt
# shop income hope hopemore hopemorerealistic
# 1: 1 700 735.0 840 740
# 2: 2 770 808.5 924 824
# 3: 3 840 882.0 1008 908
# 4: 4 910 955.5 1092 992
# 5: 5 980 1029.0 1176 1076
# 6: 6 1050 1102.5 1260 1160
# 7: 7 1120 1176.0 1344 1244
# 8: 8 1190 1249.5 1428 1328
# 9: 9 1260 1323.0 1512 1412
# 10: 10 1330 1396.5 1596 1496
I am trying to add two columns to data.table. The original structure is below:
> aTable
word freq
1: thanks for the follow 612
2: the end of the 491
3: the rest of the 462
4: at the end of 409
5: is going to be 359
6: for the first time 355
7: at the same time 346
8: cant wait to see 338
9: thank you for the 334
10: thanks for the rt 321
My code is as follows:
myKeyValfun <- function(line) {
ret1 = paste(head(strsplit(dtable4G$word,split=" ")[[1]],3), collapse=" ")
ret2 = tail(strsplit(line,split=" ")[[1]],1)
return(list(key = ret1, value = ret2))
}
aTable[, c("key","value") := myKeyValfun(word)]
After I execute this, I noticed that only that the value are correctly updated.Only the first row has the correct values. The other rows has the same values as the first rows.
See below:
> aTable
word freq key value
1: thanks for the follow 612 thanks for the follow
2: the end of the 491 thanks for the follow
3: the rest of the 462 thanks for the follow
4: at the end of 409 thanks for the follow
5: is going to be 359 thanks for the follow
6: for the first time 355 thanks for the follow
7: at the same time 346 thanks for the follow
8: cant wait to see 338 thanks for the follow
9: thank you for the 334 thanks for the follow
10: thanks for the rt 321 thanks for the follow
Any ideas?
Adding the expected result as requested by akrun:
> aTable
word freq key value
1: thanks for the follow 612 thanks for the follow
2: the end of the 491 the end of the
3: the rest of the 462 the rest of the
4: at the end of 409 at the end of
5: is going to be 359 is going to be
6: for the first time 355 for the first time
7: at the same time 346 at the same time
8: cant wait to see 338 cant wait to see
9: thank you for the 334 thank you for the
10: thanks for the rt 321 thanks for the rt
If we need to extract the first three words in to 'key' and the last word to 'value', one option is sub
aTable[, c('key', 'value') := list(sub('(.*)\\s+.*', '\\1', word), sub('.*\\s+', '', word))]
aTable
# word freq key value
# 1: thanks for the follow 612 thanks for the follow
# 2: the end of the 491 the end of the
# 3: the rest of the 462 the rest of the
# 4: at the end of 409 at the end of
# 5: is going to be 359 is going to be
# 6: for the first time 355 for the first time
# 7: at the same time 346 at the same time
# 8: cant wait to see 338 cant wait to see
# 9: thank you for the 334 thank you for the
#10: thanks for the rt 321 thanks for the rt
Or we use tstrsplit
aTable[, c('key', 'value') := {
tmp <- tstrsplit(word, ' ')
list(do.call(paste, tmp[1:3]), tmp[[4]])}]
I want to demean a whole data.table object (or just a list of many columns of it) by groups.
Here's my approach so far:
setkey(myDt, groupid)
for (col in colnames(wagesOfFired)){
myDt[, paste(col, 'demeaned', sep='.') := col - mean(col), with=FALSE]
}
which gives
Error in col - mean(col) : non-numeric argument to binary operator
Here's some sample data. In this simple case, there's only two columns, but I typically have so many columns such that I want to iterate over a list
y groupid x
1: 3.46000 51557094 97
2: 111.60000 51557133 25
3: 29.36000 51557133 23
4: 96.38000 51557133 9
5: 65.22000 51557193 32
6: 66.05891 51557328 10
7: 9.74000 51557328 180
8: 61.59000 51557328 18
9: 9.99000 51557328 18
10: 89.68000 51557420 447
11: 129.24436 51557429 15
12: 3.46000 51557638 3943
13: 117.36000 51557642 11
14: 9.51000 51557653 83
15: 68.16000 51557653 518
16: 96.38000 51557653 14
17: 9.53000 51557678 18
18: 7.96000 51557801 266
19: 51.88000 51557801 49
20: 10.70000 51558040 1034
The problem is that col is a string, so col-mean(col) cannot be computed.
myNames <- names(myDt)
myDt[,paste(myNames,"demeaned",sep="."):=
lapply(.SD,function(x)x-mean(x)),
by=groupid,.SDcols=myNames]
Comments:
You don't need to set a key.
It's in one operation because using [ repeatedly can be slow.
You can change myNames to some subset of the column names.