R: Group number string to from-to form - r

I have (after a long script) a value/vector that look like
258 814 815 816 817 818 819 862 863 864 865 866 867 868
869 870 871 872 1377 1378 1379 1393 1394 1395 1396 1397 1398
1399 1400 ........
This is quite difficult to get controll over. So I would like if there was some way to get it to
258
814-819
862-872
1377-1379
1393-1400
and so on....
I have thought about some sort of for loop that adds value to string if x[i+1]!=x[i]+1, but this can take some time if the dataset is large...
For input
x <- c(258, 814:819, 862:872, 1377:1379, 1393:1400)
The output should be
"258\n814-819\n862-872\n1377-1379\n1393-1400"

Adding on to Josh's answer this should work:
rr <- rle(x - seq_along(x))
rr$values <- seq_along(rr$values)
s <- split(x, inverse.rle(rr))
paste(lapply(s, FUN = function(x) if(length(x) > 1){paste(x[1], x[length(x)], sep="-")}else{x}), collapse="\n")
[1] "258\n814-819\n862-872\n1377-1379\n1393-1400"

In addition to the options above and at the linked question, there is also seqToHumanReadable from the "R.utils" package:
library(R.utils)
seqToHumanReadable(x)
# [1] "258, 814-819, 862-872, 1377-1379, 1393-1400"
To get your exact desired output, use gsub:
gsub(",\\s+", "\n", seqToHumanReadable(x))
# [1] "258\n814-819\n862-872\n1377-1379\n1393-1400"

Related

Report the mean number of characters in Corpus document

So I have a corpus setup reading bunch of text file with paragraphs in them.
library('tm')
my.text.location <- "C:/Users//.../*/"
apapers <- VCorpus(DirSource(my.text.location))
Now I need to find the mean of the characters in each text. Running a
mean(nchar(apapers), na.rm =T) results in a very weird output, more than the number of characters.
Any other way to get the mean?
You didn't supply a reproducible example, but rowMeans(sapply(apapers, nchar)) will return the mean number of characters over all documents. "Content" is the column you need.
A longer version is running a sapply over the corpus counting the number of per document. Transpose this data and turn it into a data.frame. The data.frame will contain two columns, content and meta. Content is the one you need. Taking the mean of the content column will give you the average number of characters in a document. The advantage of this is that you have the table in case you need to report the numbers.
# your code
my_count <- data.frame(t(sapply(apapers, nchar)))
mean(my_count$content)
Reproducible example using the crude dataset:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
# in one statement
rowMeans(sapply(crude, nchar))
content meta
1220.30 453.15
# longer version keeping intermediate results.
my_count <- data.frame(t(sapply(crude, nchar)))
mean(my_count$content)
[1] 1220.3
my_count
content meta
127 527 440
144 2634 458
191 330 444
194 394 441
211 552 441
236 2774 455
237 2747 477
242 930 453
246 2115 440
248 2066 466
273 2241 458
349 593 492
352 621 468
353 591 445
368 629 440
489 876 445
502 1166 446
543 463 447
704 1797 456
708 360 451

Convert time values to numeric while keeping time characteristics

I have a data set which contains interval times of different events occurring. What I want to do, is convert the data into a numeric vector, so its easier to manipulate and run summaries/make graphs etc, while keeping its time characteristics. Here is a snippet of my data:
data <- c( "03:31", "12:17", "16:29", "09:52", "04:01", "09:00", "06:29",
"04:17", "04:42")
class(data)
[1] character
The obvious answer is :
as.numeric(data)
But I get this error:
Warning message:
NAs introduced by coercion
I thought of maybe taking the ':' out, but then it loses its time characteristics. By that, I mean that if I sum values together say 347 and 543, it would give me 890 as opposed to 930. Here is the code that I would use to take the colon out, which works fine for its purpose:
Nocolon <- gsub("[:]", "", Data, perl=TRUE)
"0331" "1217" "1629" "0952" "0401" "0900" "0629" "0417" "0442"
So essentially, what I want is for my time values to be in a form which is easy to manipulate and analyse. My idea is for it to be a numeric vector, but that is from my minimal understanding of R. My actual code has thousands of time values, and I want to create a plot that will allow me to view and determine whether the values follow a statistical distribution.
Thanks in advance!
Here are some approaches. All convert to minutes. For example, the first component is "03:31" which is 3 * 60 + 31 = 211 minutes. (1) to (5) do not use any packages.
1) %*% It works by reading data into a 2 column data frame with hours and minutes. That is converted to a matrix so that it can be matrix multiplied by c(60, 1). Finally, unravel it with c.
c(as.matrix(read.table(text = data, sep = ":")) %*% c(60, 1))
[1] 211 737 989 592 241 540 389 257 282
2) with This variation is even shorter. It creates the same data frame but and then simply mulitiplies the first column (V1) by 60 and adds it to the second column (V2).
with(read.table(text = data, sep = ":"), 60*V1+V2)
[1] 211 737 989 592 241 540 389 257 282
3) complex This converts each component to a complex number and then performs the required arithmetic on the real and imaginary parts:
data_c <- as.complex(sub(":(\\d+)", "+\\1i", data))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
3a) This variation of (3) also works and avoids regular expressions:
data_c <- as.complex(paste0(chartr(":", "+", data), "i"))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
4) eval This converts each component into an arithmetic expression which evaluates to the number of minutes and then performs the evalution. Using eval is not really recommended when you can avoid it so this one is less desirable:
sapply(parse(text = sub("(\\d+):", "60*\\1+", data)), eval)
## [1] 211 737 989 592 241 540 389 257 282
5) POSIXlt We can convert to "POSIXlt" class and then use the hour and min components:
with(unclass(as.POSIXlt(data, format = "%H:%M")), 60 * hour + min)
## [1] 211 737 989 592 241 540 389 257 282
6) chron Using the chron package we can paste on the seconds, convert to "times" class and then convert to minutes:
library(chron)
24 * 60 * as.numeric(times(paste0(data, ":00")))
## [1] 211 737 989 592 241 540 389 257 282
7) lubridate Using the lubridate package we can convert it using hm and then to numeric giving seconds and finally dividing by 60 to give minutes:
as.numeric(hm(data)) / 60
## [1] 211 737 989 592 241 540 389 257 282
Use the as.difftime function designed for this:
as.difftime(data, format="%H:%M", units="mins")
#Time differences in mins
#[1] 211 737 989 592 241 540 389 257 282

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

how to de-merge column name and data in 1st row when importing a .csv file?

Code:
data <- read.csv("./data.csv",header=T)
data
Output:
X224786 X578 X871 X9719
1 230034 546 969 10262
2 236562 599 845 10120
Expected Output:
A B C D
224786 578 871 9719
230034 546 969 10262
236562 599 845 10120
Obviously, your *.csv file has no header line. So, try:
data <- read.csv("./data.csv", header=F)
names(data) <- c("A","B","C","D")
Try read.table instead of read.csv. read.csv requires commas between each field.

R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables:
> dist_sub[[1]]$zip
[1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981
> dist_sub[[1]]$hu
[1] 4990 NA 168 13224 NA 3805 NA 6096 3884 4065 NA 16538 NA 12348 10850 NA
[17] 9322 17728 NA 13969 24971 5413 47317 7893 NA NA NA NA NA 140 NA 4
[33] NA NA NA NA NA 13394 8939 NA 3848 7894 2228 17775 NA NA NA
> dist_sub[[2]]$zip
[1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971
> dist_sub[[2]]$hu
[1] 17728 140 4169 32550 18275 NA 22445 0 13394 8939 3848 7894 2228 17775 NA 12895
Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?
Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.
Any help is very much appreciated.
Here is a simulate dataset for your problem so that others can use it too.
dist_sub <- list(list("zip"=1:10,
"hu"=rnorm(10)),
list("zip"=8:12,
"hu"=rnorm(5)),
list("zip"=c(1, 3, 11, 7),
"hu"=rnorm(4))
)
Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:
do.this <- function (x) {
for(k in 1:(length(x) - 1)) {
for (l in (k + 1):length(x)) {
to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
}
}
return(x)
}
The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.
The solution to use the criterion you have, i.e. lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:
sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]
Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu. Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from ith element. That is what this solution does too. Do you think that makes sense?
PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.

Resources