Remove row with specific value - r

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.

You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.

When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Related

Capture the column index in R or excel for a series of data for a condition

I would like to capture the index value for any value less than 500 for a series of data.
Below is how my data looks like
Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788
For example, in the first row, Price6 is the first element for the series between Price1 to Price6, where value is less than 500, hence in the output "First" has 6.
Similarly, for second row, Price4 has less than 500, and next Price5 has less than 500, hence the value for First and Second are 4,5 respectively for the series of data between price1 and Price6.
When nothing is capture in the logic, i want to place a "-" for the same.
Below is the output i am looking for.
Category,Price1,Price2,Price3,Price4,Price5,Price6,First,Second,Third,Fourth,Fifth,Sixth
Product1,967,855,929,811,501,387,6,-,-,-,-,-
Product2,526,809,723,304,315,671,4,5,-,-,-,-
Product3,412,133,369,930,400,337,1,2,3,5,6,-
Product4,709,241,625,822,967,952,2,-,-,-,-,-
Product5,395,506,110,280,829,817,1,3,4,-,-,-
Product6,803,618,794,214,605,788,4,-,-,-,-,-
Not sure how to do the same in R or excel.
Any leads would be highly appreciated.
Thanks,
Using data.table
dt[, when := melt(dt, id.vars = "Category")[, toString(which(value < 500)), Category][, V1]]
cbind(dt, dt[, tstrsplit(when, ", ", fill = "-")])
Gives
Category Price1 Price2 Price3 Price4 Price5 Price6 when V1 V2 V3 V4 V5
1: Product1 967 855 929 811 501 387 6 6 - - - -
2: Product2 526 809 723 304 315 671 4, 5 4 5 - - -
3: Product3 412 133 369 930 400 337 1, 2, 3, 5, 6 1 2 3 5 6
4: Product4 709 241 625 822 967 952 2 2 - - - -
5: Product5 395 506 110 280 829 817 1, 3, 4 1 3 4 - -
6: Product6 803 618 794 214 605 788 4 4 - - - -
Now you just need to replace the names V1-V5 and drop column when.
Data:
dt <- fread("Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788")
One can try apply and tidyr::separate based solution as:
# First merge the data after moving values < 500 in left.
# The empty places should be filled with `-`
df$DesiredData <- apply(df[2:7],1,function(x){
value <- x[x<500]
paste0(c(value,rep("-",length(x)-length(value))),collapse = ",")
})
library(tidyverse)
# Now use `separate` function to split column in 6 desired columns
df %>% separate("DesiredData",
c("First","Second","Third","Fourth","Fifth","Sixth"), sep = ",")
# Category Price1 Price2 Price3 Price4 Price5 Price6 First Second Third Fourth Fifth Sixth
# 1 Product1 967 855 929 811 501 387 387 - - - - -
# 2 Product2 526 809 723 304 315 671 304 315 - - - -
# 3 Product3 412 133 369 930 400 337 412 133 369 400 337 -
# 4 Product4 709 241 625 822 967 952 241 - - - - -
# 5 Product5 395 506 110 280 829 817 395 110 280 - - -
# 6 Product6 803 618 794 214 605 788 214 - - - - -
Data:
df <- read.table(text="
Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788",
header = TRUE, stringsAsFactors = FALSE, sep=",")

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

R row summing to a new column [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 7 years ago.
I'm new to R and this is probablly a very simple question.
I just haven't been able to make rsum/apply work
My task is to add all the different expense categories in my dataframe and create a new variable with this value like this:
(not the original)
Food Dress Car
235 564 532
452 632 719
... ... ...
and then
Food Dress Car Total
235 564 532 1331
452 632 719 1803
... ... ... ...
I have tried:
rowsum, apply and aggregate and can't get it right
You can use addmargins after converting to matrix
addmargins(as.matrix(df1),2)
# Food Dress Car Sum
#[1,] 235 564 532 1331
#[2,] 452 632 719 1803
Or use rowSums
df1$Total <- rowSums(df1)
Or with Reduce
df1$Total <- Reduce(`+`, df1)
With apply functions:
cbind(dat, Total = apply(dat, 1, sum))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803
or with just a:
cbind(dat, Total = rowSums(dat))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

R remove rows containing a certain value

So it got a csv I'm reading into an R dataframe, it looks like this
clientx,clienty,screenx,screeny
481,855,481,847
481,784,481,847
481,784,481,847
879,292,879,355
First line is of course the header. So we have 4 columns with numeric data in it, ranging from 1 to 4 digits. There are no negative numbers in the set except -1 which marks a missing value.
I want to remove every row that contains a -1 in any of the 4 columns.
Thanks in advance for the help
Your most efficient way will be to use the na.strings argument of read.csv() to code all -1 values as NA, then to drop incomplete cases.
Step 1: set na.strings=-1 in read.csv():
x <- read.csv(text="
clientx,clienty,screenx,screeny
481,855,481,847
481,784,481,847
481,784,481,847
-1,292,879,355", header=TRUE, na.strings=-1)
x
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
4 NA 292 879 355
Step 2: Now use complete.cases or na.omit:
x[complete.cases(x), ]
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
na.omit(x)
clientx clienty screenx screeny
1 481 855 481 847
2 481 784 481 847
3 481 784 481 847
The direct way:
df <- df[!apply(df, 1, function(x) {any(x == -1)}),]
UPDATE: this approach will fail if data.frame contains character columns because apply implicitly converts data.frame to matrix (which contains data of only one type) and character type has a priority over numeric types thus data.frame will be converted into character matrix.
Or replace -1 with NA and then use na.omit:
df[df==-1] <- NA
df <- na.omit(df)
These should work, I didn't check. Please always try to provide a reproducible example to illustrate your question.

Resources