Related
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y.
The X column is to contain the values from the A column, but with the division performed. Values from the first row (from column A) divided by the values from the second row in column A and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
The Y column is to contain the values from the B column, but with the division performed. Values from the first row (from column B) divided by the values from the second row in column B and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
So far I used Excel for this. But now I need it in R if possible in the form of a function so that I can reuse this code easily. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample results:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c("",
"0,285714286", "", "0,454545455", "", "4,5", "", "2,428571429",
"", "1,666666667"), Y = c("", "0,6", "", "0,047619048", "", "2,09375",
"", "0,791666667", "", "1,189189189")), class = "data.frame", row.names = c(NA,
-10L))
You could use dplyr's across and lag (combined with modulo for picking every second row):
library(dplyr)
df |> mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
If you want a character vector change NA to "".
Output:
A B new_A new_B
1 2 3 NA NA
2 7 5 0.2857143 0.60000000
3 5 1 NA NA
4 11 21 0.4545455 0.04761905
5 54 67 NA NA
6 12 32 4.5000000 2.09375000
7 34 19 NA NA
8 14 24 2.4285714 0.79166667
9 10 44 NA NA
10 6 37 1.6666667 1.18918919
Function:
ab_fun <- function(data, vars) {
data |>
mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
}
ab_fun(df, c(A,B))
Updated with new data and correct code. + Function
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37
Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot
Tried converting all to numeric but it just changes back to factor automatically
getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)
cp <-cor(myDF)
head(round(cp,2))
'data.frame': 12 obs. of 8 variables:
$ Month : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
$ Price.per.Seat : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
$ Paid.Seats : int 500 500 500 500 500 1000 1000 1000 1000 1000 ...
$ Active.Users : int 10 50 50 100 450 550 800 900 950 800 ...
$ Support.Cases : int 0 0 1 5 35 155 100 75 50 45 ...
$ Users.Trained : int 1 5 0 50 100 300 50 30 0 100 ...
$ Features.Used : int 5 5 5 5 8 9 9 10 15 15 ...
The results to dput(myDF) as are follows:
dput( myDF)
structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec",
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov",
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ",
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ",
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L,
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L,
950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L,
35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L,
5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L,
5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-12L))
You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:
myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
cp <-cor(myDF)
library(ggcorrplot)
ggcorrplot(cp)
You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:
library(data.table)
ir <- data.table(iris) # since you didn't produce a reproducible example
ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]
what is in there:
cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).
.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.
You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF[,2:8],as.numeric)
cor(newdf)
Edited:
If you want to use the month variable. Please install lubridate and use month function.
For example:
library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
The way to convert those months to Date class:
myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")
Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.
with(myDF, cor(Active.Users, as.numeric(MonDt) ) )
[1] 0.8269705
As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:
myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))
myDF
Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1 18-Jan 25000 50 500 10
2 18-Feb 25000 50 500 50
3 18-Mar 25000 50 500 50
4 18-Apr 25000 50 500 100
5 18-May 25000 50 500 450
6 18-Jun 40000 40 1000 550
7 18-Jul 40000 40 1000 800
8 18-Aug 40000 40 1000 900
9 18-Sep 40000 40 1000 950
10 18-Oct 40000 40 1000 800
11 18-Nov 40000 40 1000 700
12 18-Dec 40000 40 1000 600
Support.Cases Users.Trained Features.Used MonDt
1 0 1 5 2018-01-15
2 0 5 5 2018-02-15
3 1 0 5 2018-03-15
4 5 50 5 2018-04-15
5 35 100 8 2018-05-15
6 155 300 9 2018-06-15
7 100 50 9 2018-07-15
8 75 30 10 2018-08-15
9 50 0 15 2018-09-15
10 45 100 15 2018-10-15
11 10 50 15 2018-11-15
12 5 0 15 2018-12-15
This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:
How to add p values for correlation coefficients plotted using splom in lattice?
I have a data set which is applied to discretization proceeding, and I want to coerce the data set to transactions for using arules package.
CLUST_K <- structure(list(LONGITUDE = c(118.5, 118.5, 118.5, 118.5, 118.5,
118.5), LATITUDE = c(-11.5, -11.5, -11.5, -11.5, -11.5, -11.5
), DATE_START = structure(c(1419897600, 1419984000, 1420070400,
1420156800, 1420243200, 1420329600), class = c("POSIXct", "POSIXt"
)), DATE_END = structure(c(1420502400, 1420588800, 1420675200,
1420761600, 1420848000, 1420934400), class = c("POSIXct", "POSIXt"
)), FLAG = c(2, 1, 2, 2, 2, 2), SURFSKINTEMP = c(13L, 1L, 16L,
16L, 7L, 13L), SURFAIRTEMP = c(6L, 6L, 6L, 6L, 6L, 6L), TOTH2OVAP = c(5L,
17L, 17L, 17L, 17L, 17L), TOTO3 = c(16L, 16L, 16L, 10L, 7L, 7L
), TOTCO = c(12L, 12L, 8L, 4L, 12L, 12L), TOTCH4 = c(13L, 14L,
6L, 6L, 11L, 7L), OLR_ARIS = c(10L, 4L, 4L, 7L, 5L, 10L), CLROLR_ARIS = c(10L,
4L, 4L, 7L, 5L, 10L), OLR_NOAA = c(10L, 10L, 10L, 10L, 7L, 9L
), MODIS_LST = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("LONGITUDE",
"LATITUDE", "DATE_START", "DATE_END", "FLAG", "SURFSKINTEMP",
"SURFAIRTEMP", "TOTH2OVAP", "TOTO3", "TOTCO", "TOTCH4", "OLR_ARIS",
"CLROLR_ARIS", "OLR_NOAA", "MODIS_LST"), row.names = c(NA, 6L
), class = "data.frame")
from the data set CLUST_K, you can see that
LONGITUDE LATITUDE DATE_START DATE_END FLAG SURFSKINTEMP SURFAIRTEMP TOTH2OVAP TOTO3 TOTCO TOTCH4 OLR_ARIS CLROLR_ARIS OLR_NOAA MODIS_LST
1 118.5 -11.5 2014-12-30 2015-01-06 2 13 6 5 16 12 13 10 10 10 1
2 118.5 -11.5 2014-12-31 2015-01-07 1 1 6 17 16 12 14 4 4 10 1
3 118.5 -11.5 2015-01-01 2015-01-08 2 16 6 17 16 8 6 4 4 10 1
4 118.5 -11.5 2015-01-02 2015-01-09 2 16 6 17 10 4 6 7 7 10 1
5 118.5 -11.5 2015-01-03 2015-01-10 2 7 6 17 7 12 11 5 5 7 1
6 118.5 -11.5 2015-01-04 2015-01-11 2 13 6 17 7 12 7 10 10 9 1
first column to fifth column of the data set is the transaction information, and column 6 to column 15 are the transactions, and which are applied to discretization proceeding.
when I try to coerce the data set to transactions
CLUST_K_R <- CLUST_K[,6:15]
CLUST_K_R_T <- as(CLUST_K_R,"transactions")
Error in asMethod(object) :
column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 not logical or a factor. Discretize the columns first.
but I the data set has already applied to discretization proceeding
When I use split, it also seems not right
> s1 <- split(CLUST_K$SURFSKINTEMP, CLUST_K$SURFAIRTEMP,CLUST_K$TOTH2OVAP, CLUST_K$TOTO3)
> Tr <- as(s1,"transactions")
Warning message:
In asMethod(object) : removing duplicated items in transactions
> Tr
transactions in sparse format with
1 transactions (rows) and
4 items (columns)
only 1 transactions left, but it should be 6 transactions in my case.
Since you already discretized the data (via clustering), you only need to make sure that the data is encoded as nominal values (factor) not numbers (integer).
for(i in 1:ncol(CLUST_K_R)) CLUST_K_R[[i]] <- as.factor(CLUST_K_R[[i]])
CLUST_K_R_T <- as(CLUST_K_R,"transactions")
summary(CLUST_K_R_T)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
30 columns (items) and a density of 0.3333333
most frequent items:
SURFAIRTEMP=6 MODIS_LST=1 TOTH2OVAP=17 TOTCO=12 OLR_NOAA=10 (Other)
6 6 5 4 4 35
element (itemset/transaction) length distribution:
sizes
10
6
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 10 10 10 10 10
includes extended item information - examples:
labels variables levels
1 SURFSKINTEMP=1 SURFSKINTEMP 1
2 SURFSKINTEMP=7 SURFSKINTEMP 7
3 SURFSKINTEMP=13 SURFSKINTEMP 13
includes extended transaction information - examples:
transactionID
1 1
2 2
3 3
I'm new in R and to be honest don't know how to call what I'm looking for :)
I have data-set "ds" set with 2 columns:
D | res
==========
Ds 20
Dx 23
Dp 1
Ds 12
Ds 23
Ds 54
Dn 65
Ds 122
Dx 11
Dx 154
Dx 18
Do 4
Df 17
Dp 5
Dp 107
Dp 8
Df 3
Dp 33
Dd 223
Dc 7
Dv 22
Du 34
Dh 22
Ds 12
Dy 78
Dd 128
I need to calculate top 4 from column "D" by "Res" so desired result would look like :
D | Res
========
Dd 351
Dp 154
Ds 243
Dx 206
and by %age:
D | % Of Total
==========
Dd 29.10%
Dp 12.77%
Ds 20.15%
Dx 17.08%
Thanks
We can use aggregate() to obtain the sum of each type of "D", and we can introduce a new column to account for the edit of the OP and include also the percentage.
In order to display the result in the desired form, we can apply the order() function to rearrange the rows according to the value of Res. The function rev() in this case ensures that the highest value is put on top, and head() with the parameter 4 displays the first four rows.
summarized <- aggregate(Res ~. , df1, sum)
summarized$Perc <- with(summarized, paste0(round(Res/sum(Res)*100,2),"%"))
head(summarized[rev(order(summarized$Res)),],4)
D Res Perc
2 Dd 351 29.1%
8 Ds 243 20.15%
11 Dx 206 17.08%
7 Dp 154 12.77%
data
df1 <- structure(list(D = structure(c(8L, 11L, 7L, 8L, 8L, 8L, 5L,
8L, 11L, 11L, 11L, 6L, 3L, 7L, 7L, 7L, 3L, 7L, 2L, 1L, 10L, 9L,
4L, 8L, 12L, 2L), .Label = c("Dc", "Dd", "Df", "Dh", "Dn", "Do",
"Dp", "Ds", "Du", "Dv", "Dx", "Dy"), class = "factor"), Res = c(20L,
23L, 1L, 12L, 23L, 54L, 65L, 122L, 11L, 154L, 18L, 4L, 17L, 5L,
107L, 8L, 3L, 33L, 223L, 7L, 22L, 34L, 22L, 12L, 78L, 128L)),
.Names = c("D", "Res"), class = "data.frame", row.names = c(NA, -26L))
If you mean to sum Res per D and then select the top 4 sums (assuming you made mistakes calculating the sums for ds and dp) you could try:
library(dplyr)
df1 %>% mutate(per = Res/sum(Res)) %>% group_by(D) %>% summarise(Res = sum(Res), perc = sum(per)) %>% top_n(4, Res)
Source: local data frame [4 x 3]
D Res perc
(fctr) (int) (dbl)
1 Dd 351 0.2910448
2 Dp 154 0.1276949
3 Ds 243 0.2014925
4 Dx 206 0.1708126
Option using data.table
library(data.table)
out = setorder(setDT(data)[, .(tmp = sum(res)), by = D]
[, .(D, ptg = (tmp/sum(tmp))*100)], -ptg)[1:4,]
#> out
# D ptg
#1: Dd 29.10448
#2: Ds 20.14925
#3: Dx 17.08126
#4: Dp 12.76949