I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...
I wonder how to make a function to subtract values present in column A01 from columns A02, A03 etc.
example data frame:
A01 A02 A03 A04 A05 (...)
1 158 297 326 354 357
2 252 131 341 424 244
3 ...
4 ...
I can manually subtract each column for example:
sampledata[1]-sampledata[1]
sampledata[2]-sampledata[1]
sampledata[3]-sampledata[1]
sampledata[4]-sampledata[1] ... etc.
But how to make a nice function to do this calculation for each of column ? As a result I suppose to have this:
A01 A02 A03 A04 A05 (...)
1 0 139 168 196 199
2 0 -121 89 171 -8
3 ...
4 ...
After subtraction, if some value would be negative, then I want to convert it to zero.
I assume that my problem is easy to solve, but I'm newbie in R.
Thank you all for different solutions.
It seems that the simplest and still perfectly working is that suggested by #DavidArenburg:
new_sample_data = (sampledata - sampledata[,1]) * (sampledata > sampledata[,1])
It makes two transformations in one formula (subtracting first column, and converting negatives to zeroes).
Thank you!
Here's how:
# Your data
A01 <- c(158, 252)
A02 <- c(297, 131)
A03 <- c(326, 341)
A04 <- c(354, 424)
A05 <- c(357, 244)
df <- data.frame(A01, A02, A03, A04, A05, stringsAsFactors = FALSE)
df
# Define the function
f_minus <- function(first_col, other_col) {
other_col - first_col
}
df_output <- as.data.frame(matrix(ncol=ncol(df), nrow=nrow(df)))
for (i in 1:ncol(df)) {
df_output[,c(i)] <- f_minus(df[,1], df[,i])
}
df_output
# V1 V2 V3 V4 V5
# 1 0 139 168 196 199
# 2 0 -121 89 172 -8
This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 5 years ago.
i have two message ids say 197, 198 . I want to subset the data frame for those users who have got the messages from these ids. I only want those rows which contains both these message ids.
The data frame is m
I have used the code
a = c(197,198)
n = subset(m$userid,m$mid %in% a)
I also tried
n = m[m$mid == 197 & m$mid == 198]
both of these codes are creating OR output whereas I want AND output.
here is the sample dataframe:
mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y
I want output as records containing userid for both mid 197 &198
mid userid opened
197 1022 Y
197 1036 N
198 1022 N
198 1036 Y
Using sqldf one solution could be achieved as:
# data
m <- read.table(text = "mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y", header = T, stringsAsFactors = F)
library(sqldf)
result <- sqldf("SELECT * FROM m
WHERE userid in (SELECT userid FROM m WHERE mid == 197) AND
userid in (SELECT userid FROM m WHERE mid == 198)")
result
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 3 198 1022 N
# 4 198 1036 Y
Using duplicated :
m[duplicated(m$userid) | duplicated(m$userid,fromLast = T), ]
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 5 198 1022 N
# 6 198 1036 Y
With your real data you may need first : m2 <- subset(m,mid %in% a) to make sure you have only mid from a in your table before applying my solution.
For the sake of completeness, here are two data.table approaches. Both are capable of handling a of arbitrary length, i.e., with more than just 2 selected mid.
Join
library(data.table)
setDT(m)[m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)],
on = "userid"]
mid userid opened V1
1: 197 1022 Y 2
2: 198 1022 N 2
3: 197 1036 N 2
4: 198 1036 Y 2
The expression
m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)]
userid V1
1: 1022 2
2: 1036 2
filters m, then counts the number of unique mid by userid and returns those userid which have matches with all entries in a. (Instead of uniqueN(a), length(a) can be used but the former is safer).
Subsetting by row indices
There is an alternative approach which returns the row id's of m which are then used for subsetting:
m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]
userid V1
1: 1022 1
2: 1022 5
3: 1036 2
4: 1036 6
m[m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]$V1]
mid userid opened
1: 197 1022 Y
2: 198 1022 N
3: 197 1036 N
4: 198 1036 Y
Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?