I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...
I am trying to generate a vector breaks_x which is the result of another vector break_init. If the difference between two successive elements of break_init is less than 10, the element ending with two zeros will be removed.
My code is always removing breaks_init[i] even if it not ending with two zeros.
Can anyone help please
break_init <- c(100,195,200,238,300,326,400,481,500,537,600,607,697,700,800,875,900,908,957)
breaks_x <- vector()
for(i in 1:(length(break_init) - 1))
{
if (break_init[i+1] - break_init[i] >= 10) {
breaks_x[i] <- break_init[i]
} else {
if (grepl("[00]$", as.character(break_init[i])) == TRUE){
breaks_x[i] <- NA
} else if (grepl("[00]$", as.character(break_init[i])) == FALSE) {
breaks_x[i+1] <- NA
} else {
breaks_x[i] <- break_init[i]
}
}
}
[1] 0 100 NA 200 238 300 326 400 481 500 537 NA 607 NA 700 800 875 NA 908 957 #result breaks_x
[1] 0 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957 #what I want my result to be
r2evans has the right idea. Just a little modification to check both the forward and the backwards difference:
bln10 <- diff(break_init) < 10
breaks_x <- replace(break_init, (c(FALSE, bln10) | c(bln10, FALSE)) & break_init %% 100 == 0, NA)
breaks_x
# [1] 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957
I have (after a long script) a value/vector that look like
258 814 815 816 817 818 819 862 863 864 865 866 867 868
869 870 871 872 1377 1378 1379 1393 1394 1395 1396 1397 1398
1399 1400 ........
This is quite difficult to get controll over. So I would like if there was some way to get it to
258
814-819
862-872
1377-1379
1393-1400
and so on....
I have thought about some sort of for loop that adds value to string if x[i+1]!=x[i]+1, but this can take some time if the dataset is large...
For input
x <- c(258, 814:819, 862:872, 1377:1379, 1393:1400)
The output should be
"258\n814-819\n862-872\n1377-1379\n1393-1400"
Adding on to Josh's answer this should work:
rr <- rle(x - seq_along(x))
rr$values <- seq_along(rr$values)
s <- split(x, inverse.rle(rr))
paste(lapply(s, FUN = function(x) if(length(x) > 1){paste(x[1], x[length(x)], sep="-")}else{x}), collapse="\n")
[1] "258\n814-819\n862-872\n1377-1379\n1393-1400"
In addition to the options above and at the linked question, there is also seqToHumanReadable from the "R.utils" package:
library(R.utils)
seqToHumanReadable(x)
# [1] "258, 814-819, 862-872, 1377-1379, 1393-1400"
To get your exact desired output, use gsub:
gsub(",\\s+", "\n", seqToHumanReadable(x))
# [1] "258\n814-819\n862-872\n1377-1379\n1393-1400"
Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?
So it got a csv I'm reading into an R dataframe, it looks like this
clientx,clienty,screenx,screeny
481,855,481,847
481,784,481,847
481,784,481,847
879,292,879,355
First line is of course the header. So we have 4 columns with numeric data in it, ranging from 1 to 4 digits. There are no negative numbers in the set except -1 which marks a missing value.
I want to remove every row that contains a -1 in any of the 4 columns.
Thanks in advance for the help
Your most efficient way will be to use the na.strings argument of read.csv() to code all -1 values as NA, then to drop incomplete cases.
Step 1: set na.strings=-1 in read.csv():
x <- read.csv(text="
clientx,clienty,screenx,screeny
481,855,481,847
481,784,481,847
481,784,481,847
-1,292,879,355", header=TRUE, na.strings=-1)
x
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
4 NA 292 879 355
Step 2: Now use complete.cases or na.omit:
x[complete.cases(x), ]
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
na.omit(x)
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
The direct way:
df <- df[!apply(df, 1, function(x) {any(x == -1)}),]
UPDATE: this approach will fail if data.frame contains character columns because apply implicitly converts data.frame to matrix (which contains data of only one type) and character type has a priority over numeric types thus data.frame will be converted into character matrix.
Or replace -1 with NA and then use na.omit:
df[df==-1] <- NA
df <- na.omit(df)
These should work, I didn't check. Please always try to provide a reproducible example to illustrate your question.