Preparing discretization data for arules - r

I have a data set which is applied to discretization proceeding, and I want to coerce the data set to transactions for using arules package.
CLUST_K <- structure(list(LONGITUDE = c(118.5, 118.5, 118.5, 118.5, 118.5,
118.5), LATITUDE = c(-11.5, -11.5, -11.5, -11.5, -11.5, -11.5
), DATE_START = structure(c(1419897600, 1419984000, 1420070400,
1420156800, 1420243200, 1420329600), class = c("POSIXct", "POSIXt"
)), DATE_END = structure(c(1420502400, 1420588800, 1420675200,
1420761600, 1420848000, 1420934400), class = c("POSIXct", "POSIXt"
)), FLAG = c(2, 1, 2, 2, 2, 2), SURFSKINTEMP = c(13L, 1L, 16L,
16L, 7L, 13L), SURFAIRTEMP = c(6L, 6L, 6L, 6L, 6L, 6L), TOTH2OVAP = c(5L,
17L, 17L, 17L, 17L, 17L), TOTO3 = c(16L, 16L, 16L, 10L, 7L, 7L
), TOTCO = c(12L, 12L, 8L, 4L, 12L, 12L), TOTCH4 = c(13L, 14L,
6L, 6L, 11L, 7L), OLR_ARIS = c(10L, 4L, 4L, 7L, 5L, 10L), CLROLR_ARIS = c(10L,
4L, 4L, 7L, 5L, 10L), OLR_NOAA = c(10L, 10L, 10L, 10L, 7L, 9L
), MODIS_LST = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("LONGITUDE",
"LATITUDE", "DATE_START", "DATE_END", "FLAG", "SURFSKINTEMP",
"SURFAIRTEMP", "TOTH2OVAP", "TOTO3", "TOTCO", "TOTCH4", "OLR_ARIS",
"CLROLR_ARIS", "OLR_NOAA", "MODIS_LST"), row.names = c(NA, 6L
), class = "data.frame")
from the data set CLUST_K, you can see that
LONGITUDE LATITUDE DATE_START DATE_END FLAG SURFSKINTEMP SURFAIRTEMP TOTH2OVAP TOTO3 TOTCO TOTCH4 OLR_ARIS CLROLR_ARIS OLR_NOAA MODIS_LST
1 118.5 -11.5 2014-12-30 2015-01-06 2 13 6 5 16 12 13 10 10 10 1
2 118.5 -11.5 2014-12-31 2015-01-07 1 1 6 17 16 12 14 4 4 10 1
3 118.5 -11.5 2015-01-01 2015-01-08 2 16 6 17 16 8 6 4 4 10 1
4 118.5 -11.5 2015-01-02 2015-01-09 2 16 6 17 10 4 6 7 7 10 1
5 118.5 -11.5 2015-01-03 2015-01-10 2 7 6 17 7 12 11 5 5 7 1
6 118.5 -11.5 2015-01-04 2015-01-11 2 13 6 17 7 12 7 10 10 9 1
first column to fifth column of the data set is the transaction information, and column 6 to column 15 are the transactions, and which are applied to discretization proceeding.
when I try to coerce the data set to transactions
CLUST_K_R <- CLUST_K[,6:15]
CLUST_K_R_T <- as(CLUST_K_R,"transactions")
Error in asMethod(object) :
column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 not logical or a factor. Discretize the columns first.
but I the data set has already applied to discretization proceeding
When I use split, it also seems not right
> s1 <- split(CLUST_K$SURFSKINTEMP, CLUST_K$SURFAIRTEMP,CLUST_K$TOTH2OVAP, CLUST_K$TOTO3)
> Tr <- as(s1,"transactions")
Warning message:
In asMethod(object) : removing duplicated items in transactions
> Tr
transactions in sparse format with
1 transactions (rows) and
4 items (columns)
only 1 transactions left, but it should be 6 transactions in my case.

Since you already discretized the data (via clustering), you only need to make sure that the data is encoded as nominal values (factor) not numbers (integer).
for(i in 1:ncol(CLUST_K_R)) CLUST_K_R[[i]] <- as.factor(CLUST_K_R[[i]])
CLUST_K_R_T <- as(CLUST_K_R,"transactions")
summary(CLUST_K_R_T)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
30 columns (items) and a density of 0.3333333
most frequent items:
SURFAIRTEMP=6 MODIS_LST=1 TOTH2OVAP=17 TOTCO=12 OLR_NOAA=10 (Other)
6 6 5 4 4 35
element (itemset/transaction) length distribution:
sizes
10
6
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 10 10 10 10 10
includes extended item information - examples:
labels variables levels
1 SURFSKINTEMP=1 SURFSKINTEMP 1
2 SURFSKINTEMP=7 SURFSKINTEMP 7
3 SURFSKINTEMP=13 SURFSKINTEMP 13
includes extended transaction information - examples:
transactionID
1 1
2 2
3 3

Related

Choose rows in which the absolute value of subtraction is less a specified value

Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))

How do i replace the fourth row of values in a dataframe with a corresponding vector in R?

I have a set of values
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 3 7 15
I would like to replace row 4 with a vector
c(4,8,12,16)
I would like to inset the vector in column 4 and replace the original values. I tried this script.
df[[4]]<- vector_name
I expect the result
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 8 12 16
We can use replace
replace(df1, cbind(nrow(df1), seq_along(df1)), v1)
data
df1 <- structure(list(col1 = c(5L, 2L, 3L, 4L), col2 = c(10L, 4L, 6L,
3L), col3 = c(15L, 6L, 9L, 7L), col4 = c(20L, 8L, 12L, 15L)),
class = "data.frame", row.names = c(NA,
-4L))
v1 <- c(4, 8, 12, 16)

Finding columns that contain values based on another column

I have the following data frame:
Step 1 2 3
1 5 10 6
2 5 11 5
3 5 13 9
4 5 15 10
5 13 18 10
6 15 20 10
7 17 23 10
8 19 25 10
9 21 27 13
10 23 30 7
I would like to retrieve the columns that satisfy one of the following conditions: if step 1 = step 4 or step 4 = step 8. In this case, column 1 and 3 should be retrieved. Column 1 because the value at Step 1 = value at step 4 (i.e., 5), and for column 3, the value at step 4 = value at step 8 (i.e., 10).
I don't know how to do that in R. Can someone help me please?
You can get the column indices by the following code:
df[1, -1] == df[4, -1] | df[4, -1] == df[8, -1]
# X1 X2 X3
# 1 TRUE FALSE TRUE
# data
df <- structure(list(Step = 1:10, X1 = c(5L, 5L, 5L, 5L, 13L, 15L,
17L, 19L, 21L, 23L), X2 = c(10L, 11L, 13L, 15L, 18L, 20L, 23L,
25L, 27L, 30L), X3 = c(6L, 5L, 9L, 10L, 10L, 10L, 10L, 10L, 13L,
7L)), class = "data.frame", row.names = c(NA, -10L))

How to change type of scientific number factor column into numeric in a data frame using dplyr

I have the following data frame:
library(tidyverse)
df <- structure(list(rank = structure(c(1L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), .Label = c("1",
"10", "11", "12", "13", "14", "15", "16", "17\n*", "2", "3",
"4", "5", "6", "7", "8", "9"), class = "factor"), p_value = structure(c(2L,
5L, 17L, 16L, 13L, 12L, 11L, 10L, 9L, 8L, 4L, 3L, 14L, 7L, 6L,
1L, 15L), .Label = c("1e-12", "1e-12262", "1e-164", "1e-176",
"1e-2381", "1e-26", "1e-27", "1e-274", "1e-369", "1e-397", "1e-413",
"1e-422", "1e-429", "1e-57", "1e-6", "1e-855", "1e-919"), class = "factor")), row.names = c(NA,
-17L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("rank",
"p_value"))
The df looks like this:
# A tibble: 17 x 2
rank p_value
<fctr> <fctr>
1 1 1e-12262
2 2 1e-2381
3 3 1e-919
4 4 1e-855
5 5 1e-429
6 6 1e-422
7 7 1e-413
8 8 1e-397
9 9 1e-369
10 10 1e-274
11 11 1e-176
12 12 1e-164
13 13 1e-57
14 14 1e-27
15 15 1e-26
16 16 1e-12
17 "17\n*" 1e-6
My question is how to convert p_value column type from fctr to numeric so that I can perform math operation with it.
I tried this with error
> df %>% mutate(logp = log(p_value))
Error in mutate_impl(.data, dots) :
Evaluation error: ‘log’ not meaningful for factors.
You can convert these to numbers like this. You first need to convert factors to character before numeric, otherwise you just get the numerical factor levels.
df %>% mutate(logp = log(as.numeric(as.character(p_value))))
# A tibble: 17 x 3
rank p_value logp
<fctr> <fctr> <dbl>
1 1 1e-12262 -Inf
2 2 1e-2381 -Inf
3 3 1e-919 -Inf
4 4 1e-855 -Inf
5 5 1e-429 -Inf
6 6 1e-422 -Inf
7 7 1e-413 -Inf
8 8 1e-397 -Inf
9 9 1e-369 -Inf
10 10 1e-274 -630.90832
11 11 1e-176 -405.25498
12 12 1e-164 -377.62396
13 13 1e-57 -131.24735
14 14 1e-27 -62.16980
15 15 1e-26 -59.86721
16 16 1e-12 -27.63102
17 "17\n*" 1e-6 -13.81551

Handling ties in finding index of n th maximum value in R

I am working on a dataframe and trying to find the index of nth maximum value (n varies by a loop), however, in the columns I have tied values and the program throws an error. Below is a sample dataset. I am basically trying to generate a similar dataframe, but with only the index values of all the values in the column vector of the dataframe.
For the output DF, column 1 in the output DF will have index values of elements of Refer_1, so Output_DF[1,1] will have the index for highest value, while Output_DF[10,1] will have the index of lowest value. Below is the input DF.
Input
1 17
2 21
3 13
4 26
5 204
6 36
7 14
8 25
9 45
10 37
Output (index values)
5
9
10
6
4
8
2
1
7
3
I am currently using which, unlist and partial together to get the indexes, however, I am unable to rectify the error. Note that the ties can occur with any nth maximum value (not necessarily the column maxima).
which(Consolidated_data_new[,i]==unlist(sort(Consolidated_data_new[,i],partial=j)[j]))
Please note that I want the code to return only one value at a time, and handle the 2nd tied value in the next loop iteration.
Please help solve this.
Regards,
library(data.table)
DT<-structure(list(Refer_1 = c(11L, 15L, 7L, 19L, 104L, 24L, 11L,
22L, 39L, 19L), Refer_2 = c(17L, 21L, 13L, 25L, 204L, 36L, 14L,
25L, 45L, 37L)), .Names = c("Refer_1", "Refer_2"), row.names = c(NA,
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000130788>)
DT[,lapply(.SD, order,decreasing=TRUE)]
Refer_1 Refer_2
1: 5 5
2: 9 9
3: 6 10
4: 8 6
5: 4 4
6: 10 8
7: 2 2
8: 1 1
9: 7 7
10: 3 3
Your comments suggest you are working with a dataframe that has more than one column and that you want an output dataframe that has the results of order with decreasing=TRUE applied to every column:
> DF[2] <- sample(1:300, 10)
> DF[3] <- sample(1:300, 10)
> DF
Input V2 V3
1 17 210 3
2 21 72 4
3 13 263 1
4 26 249 6
5 204 223 10
6 36 83 7
7 14 107 2
8 25 295 5
9 45 198 9
10 37 112 8
> ordDF <- as.data.frame(lapply(DF, order, decreasing=TRUE))
> names(ordDF) <- paste0("res", 1:length(DF) )
> ordDF
res1 res2 res3
1 5 8 4
2 9 3 9
3 10 4 2
4 6 5 7
5 4 1 10
6 8 9 8
7 2 10 1
8 1 7 6
9 7 6 3
10 3 2 5
> dput(ordDF)
structure(list(res1 = c(5L, 9L, 10L, 6L, 4L, 8L, 2L, 1L, 7L,
3L), res2 = c(8L, 3L, 4L, 5L, 1L, 9L, 10L, 7L, 6L, 2L), res3 = c(4L,
9L, 2L, 7L, 10L, 8L, 1L, 6L, 3L, 5L)), .Names = c("res1", "res2",
"res3"), row.names = c(NA, -10L), class = "data.frame")

Resources