How can create my own factor column in a dataframe?

How can create my own factor column in a dataframe? - r

I have dataframe and task:"Define your own criterion of income level, and split data according to levels of this criterion"
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no",
"yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L
), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25),
income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991,
0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384
), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033,
91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no",
"yes"), class = "factor"), selfemp = structure(c(1L, 1L,
1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,
58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L
), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13,
0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
I defined this criterion in this way
inc_l<-c("low","average","above average","high")
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 && x<10, 'above average',
ifelse(x>=3 && x<=6,'average',
ifelse(x<3, 'low'))))
}
And added a column like this
creditcard<-transform(creditcard, incom_levev=factor(sapply(creditcard$income, grad_fact), inc_l, ordered = TRUE))
But I need not to use saaply for this and I tried to do it in this way
creditcard<-transform(creditcard, incom_level=factor(grad_fact(creditcard$income),inc_l, ordered = TRUE))
But in this case, all the elements of the column take the value "average" and I don't understand why, please help me figure out the problem

We may need to change the && to & as && will return a single TRUE/FALSE. According to ?"&&"
& and && indicate logical AND and | and || indicate logical OR. The shorter forms performs elementwise comparisons in much the same way as arithmetic operators. The longer forms evaluates left to right, proceeding only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.
In addition, the last ifelse didn't had a no case
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 & x<10, 'above average',
ifelse(x>=3 & x<=6,'average',
ifelse(x<3, 'low', NA_character_))))
}
and then use
creditcard <- transform(creditcard, incom_level=
factor(grad_fact(income),inc_l, ordered = TRUE))
-output
creditcard
card reports age income share expenditure owner selfemp dependents days majorcards active income_fam incom_level
1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3 54 1 12 1.130000 average
2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3 34 1 13 0.605000 low
3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4 58 1 5 0.900000 average
4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0 25 1 7 2.540000 low
5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2 64 1 5 3.262233 above average
6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0 54 1 1 2.500000 low

Related

How to add column reporting sum of couple of subsequent rows

I have the following dataset
structure(list(Var1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("0", "1"), class = "factor"), Var2 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("congruent", "incongruent"
), class = "factor"), Var3 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("spoken", "written"), class = "factor"),
Freq = c(8L, 2L, 10L, 2L, 10L, 2L, 10L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I would like to add another column reporting sum of coupled subsequent rows. Thus the final result would look like this:
I have proceeded like this
Table = as.data.frame(table(data_1$unimodal,data_1$cong_cond, data_1$presentation_mode)) %>%
mutate(Var1 = factor(Var1, levels = c('0', '1')))
row = Table %>% #is.factor(Table$Var1)
summarise(across(where(is.numeric),
~ .[Var1 == '0'] + .[Var1 == '1'],
.names = "{.col}_sum"))
column = c(rbind(row$Freq_sum,rep(NA, 4)))
Table$column = column
But I am looking for the quickest way possible with no scripting separated codes. Here I have used the dplyr package, but if you might know possibly suggest some other ways with map(), for loop, and or the method you deem as the best, please just let me know.

This should do:
df$column <-
rep(colSums(matrix(df$Freq, 2)), each=2) * c(1, NA)

If you are fine with no NAs in the dataframe, you can
df %>%
group_by(Var2, Var3) %>%
mutate(column = sum(Freq))
# A tibble: 8 × 5
# Groups: Var2, Var3 [4]
Var1 Var2 Var3 Freq column
<fct> <fct> <fct> <int> <int>
1 0 congruent spoken 8 10
2 1 congruent spoken 2 10
3 0 incongruent spoken 10 12
4 1 incongruent spoken 2 12
5 0 congruent written 10 12
6 1 congruent written 2 12
7 0 incongruent written 10 12
8 1 incongruent written 2 12

How to do multiple arithmetic operations according to conditions between two datasets in R

I have several datasets.
The first one
lid=structure(list(x1 = 619490L, x2 = 10L, x3 = 0L, x4 = 6089230L,
x5 = 0L, x6 = -10L), class = "data.frame", row.names = c(NA,
-1L))
second dataset
lidar=structure(list(A = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), B = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
C = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
How to subtract the value for each row of the lidar dataset from lid dataset using the formula
(lidar$A-lid$x1)/lid$x3
then
(lidar$B-lid$x4)/lid$x6
So for first row will be result
(lidar$A-lid$x1)/lid$x3=1874,876(but everything after the comma is discarded)=1874(without,876)
(lidar$B-lid$x4)/lid$x6=1122
also in lidar dataset for column lidar$C
subtract the smallest value from the largest value. In this case lidar$c11-lidar$c1=5,5
so desired output for this will be
A B C Intensity ReturnNumber NumberOfReturns row col subs(lidar$Cmax-lidar$Cmin)
638238.76 6078001.09 186.64 13 2 2 1874 1122 5,5
638238.76 6078001.09 186.59 99 1 1 1874 1122 5,5
638239.29 6078001.15 199.28 5 1 3 1874 1122 5,5
638235.39 6078001.15 189.37 2 2 2 1874 1122 5,5
the result of subtraction (lidar$Cmax-lidar$Cmin) for all rows is always the same.
row and col this the result of this arithmetic
(lidar$A-lid$x1)/lid$x3 (row)
then
(lidar$B-lid$x4)/lid$x6 (col)
with the value after the comma, these values(row and col) are different, but we must remove the part after the comma, so they seem to be the same.
How can i get desired output according to such arithmetic operations.
Any of your help is valuable.Thank you

If I understand your purpose correctly, the main question is how to remove the part after comma, which is a decimal separator in your examples.
If that's true, one way of doing that is to split the number into two parts, one which comes before the comma and another one which comes after it, and then extract only the first part. In R you can do this by strsplit(). However, this function requires the input to be characters, not numerics. So, you need to coerce the numbers into characters, do the splitting, coerce the result back to numbers, and then extract its first element.
Here is an example of a function to implement the steps:
remove_after_comma <- function(num_with_comma){
myfun <- function(num_with_comma) {
num_with_comma|>
as.character() |>
strsplit("[,|.]") |>
unlist() |>
as.numeric() |>
getElement(1)
}
vapply(num_with_comma, myfun, FUN.VALUE = numeric(1))
}
Notes:
[,|.] is used to anticipate other systems that use . instead of , as the decimal separator.
vapply is used to make it possible to apply this function to a numeric vectors, such as a numeric column.
Check:
remove_after_comma(c(a = '1,5', b = '12,74'))
# a b
# 1 12
(4:10)/3
#[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
remove_after_comma ((4:10)/3)
#[1] 1 1 2 2 2 3 3
Assuming that lid$x3 = 10L in your example:
(lidar$A-lid$x1)/lid$x3
#[1] 1874.876 1874.876 1874.929 1874.539 1874.386 1874.386 1874.555 1874.197 #1874.191 1873.841
remove_after_comma((lidar$A-lid$x1)/lid$x3)
#[1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1873

I'm not sure if this is what you mean
`
lidar$row <- round((lidar$A-lid$x1)/lid$x3, 0)
lidar$col <- (lidar$B-lid$x4)/lid$x6
lidar$cdif <- max(lidar$C)-min(lidar$C)
`

Forecasting for multiple products

I want to forecast values for multiple products (in this case product_id 1 and 2 but I actually have a few thousand products) at the same time.
product_id Date Revenue Value
1 1 1/10/12 in 0
2 1 1/13/12 in 1
3 1 2/14/16 in 0
4 1 3/5/16 out 0
5 1 1/5/17 out 0
6 1 3/15/17 out 0
7 2 11/1/11 in 1
8 2 3/14/15 in 2
9 2 1/15/16 in 3
10 2 3/15/17 out 0
11 2 4/11/17 out 0
12 2 5/16/17 out 0
If this were only one product, I would fill in the missing dates with:
allDates <- seq.Date(
min(Dat$Date),
max(Dat$Date),
"day")
allValues <- merge(
x=data.frame(Date=allDates),
y=Value,
all.x=TRUE)
Make the data time series:
time <- ts(dat$Value, start= c(2011,11), frequency=52)
Forecast using hybrid model:
hm1 <- hybridModel(y = time, weights = "insample.errors")
plot(forecast(hm1))
Is there a way that I can do this for both product ids? Or is there a cleaner method without filling in the blank dates?
dat <-structure(list(product_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
Date = c("1/10/12", "1/13/12", "2/14/16", "3/5/16", "1/5/17", "3/15/17", "1/1/11", "3/14/15", "1/15/16", "3/15/17", "4/11/17", "5/16/17"),
Revenue = c("in", "in", "in", "out", "out", "out", "in", "in", "in", "out", "out", "out"),
Value = c(0L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 0L, 0L, 0L)
),
.Names = c("product_id", "Date", "Revenue", "Value"),
class = "data.frame", row.names = c(NA, -12L))

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000

As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.

Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.

This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can create my own factor column in a dataframe? - r

Related

How to add column reporting sum of couple of subsequent rows

How to do multiple arithmetic operations according to conditions between two datasets in R

Forecasting for multiple products

Replacing loop in dplyr R

Checking row format of csv

Categories

Resources