Date converted into month, day, week etc in shiny application - r

I am very new in shiny application.I have already build customer prediction algorithm (linear regression) using R. In that case, I extracted date into month, days, week etc.and considered as independent variable. Now, I would like to build shiny application for the same. Application will take date as input and output will be the predicted customer number. I have date wise historical customer data. for analysis, I produced month, week, day as independent variable and number of customer is my dependent variable. I trained linear regression for prediction purpose. I faced the problem to convert new input date into month, day, week etc.
so that I can predict new date for number of customer. Kindly help me in this regards.
Prediction
library(caret)
mydata <- read.csv("main.csv", header = TRUE)
mydata$date <- as.Date(mydata$date, format = "%m/%d/%Y")
mydata$month <- strftime(mydata$date, "%m")
mydata$day <- strftime(mydata$date, "%d")
mydata$week <- strftime(mydata$date, "%w")
mydata$week_year <- strftime(mydata$date, "%W")
mydata$month <- as.factor(mydata$month)
mydata$day <- as.factor(mydata$day)
mydata$week <- as.factor(mydata$week)
mydata$week_year <- as.factor(mydata$week_year)
mydata <- mydata[c(1, 3, 4, 5, 6)]
ind <- sample(2, nrow(mydata), replace = TRUE, prob=c(0.7, 0.3))
trainset = mydata[ind == 1,]
testset = mydata[ind == 2,]
pred_cus <-glm(no_customer~month+week+day,
data = trainset,
family = gaussian)
testset$prediction <- predict(pred_cus, testset)
RMSE(testset$prediction, testset$no_customer)
structure(list(
Customer = c(94L, 61L, 51L, 28L, 29L, 56L, 99L, 87L, 88L, 71L, 40L, 33L,
57L, 71L, 84L, 81L, 57L, 31L, 28L, 77L, 84L, 69L, 76L, 65L,
36L, 26L, 60L, 70L, 82L, 81L, 49L, 54L, 18L, 66L, 89L, 69L,
61L, 88L, 40L, 25L, 82L, 77L, 88L, 72L, 75L, 40L, 24L, 79L,
67L, 82L, 55L, 78L, 44L, 14L, 76L, 89L, 87L, 93L, 64L, 23L,
34L, 65L, 83L, 92L, 87L, 105L, 40L, 32L, 80L, 76L, 83L, 76L,
70L, 43L, 33L, 75L, 75L, 70L, 55L, 70L, 36L, 13L, 64L, 72L,
79L, 62L, 52L, 30L, 32L, 85L, 87L, 84L, 93L, 73L, 21L, 19L,
101L, ''''''''''''''''''''''''''''''''
Label = c("1/1/2016", "1/10/2016", "1/11/2016", "1/12/2016", "1/13/2016",
"1/14/2016", "1/15/2016", "1/16/2016", "1/17/2016", "1/18/2016",
"1/19/2016", "1/2/2016", "1/20/2016", "1/21/2016", "1/22/2016",
"1/23/2016", "1/24/2016", "1/25/2016", "1/26/2016", "1/27/2016",
"1/28/2016", "1/29/2016", "1/3/2016", "1/30/2016", "1/31/2016",
"1/4/2016", "1/5/2016", "1/6/2016", "1/7/2016", "1/8/2016",
"1/9/2016", "10/1/2015", "10/10/2015", "10/11/2015", "10/12/2015",
"10/13/2015", "10/14/2015", "10/15/2015", "10/16/2015", "10/17/2015",
"10/18/2015", "10/19/2015", "10/2/2015", "10/20/2015", "10/21/2015",
"10/22/2015", "10/23/2015", "10/24/2015",
class = "factor")
),
.Names = c("Customer", date"),
class = "data.frame",
row.names = c(NA, -457L)
)

Related

Using case_when to fill out a string

I am trying to use case_when in order to pad out a string in R, dependent on the string length.
I take the following 3 examples with lengths 11, 12 and 13:
V1 V2
74300000330 00074300000330
811693200042 08011693200042
8829999820128 88029999820128
V1 is the column I am trying to match with V2
The first row in V1 has 11 digits, if the row has 11 digits then add 3 zeros at the begining of the number.
I have tried the following code without any luck (I have also tried it with paste0());
df %>%
mutate(col3 = case_when(length(col1) == 11 ~ str_pad(14, width = 3, pad = "0")))
The second has 12 digits, so I should add one zero at the begining of the number and then another zero between (counting from the left) the first digit and (counting from right) 11th digit, so row 2 would go from 81169... to 0801169....
The third row has 13 digits so I should paste a zero between the (counting from the left) 2nd digit and (counting from the right) the 11th digit. So the begining of the sequence goes from 88299 to 880299.
The total number of digits in the sequence should be exactly 14.
Data:
df <- structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L,
20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 21L,
21L, 21L, 22L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L,
23L, 23L, 23L, 24L, 24L, 24L, 24L, 25L, 26L, 27L, 27L, 27L, 27L,
27L, 27L, 27L, 27L, 27L, 27L, 28L, 28L, 28L, 29L, 30L, 30L, 30L,
31L, 32L, 33L, 33L, 33L, 33L, 33L, 34L, 34L, 34L, 34L, 35L, 36L,
36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 37L, 38L, 38L, 38L, 38L,
38L, 39L, 39L, 39L, 39L, 40L, 41L, 41L, 41L, 42L, 42L, 43L, 44L,
45L, 45L, 45L, 45L, 45L, 46L, 46L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 48L, 49L, 49L, 49L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L,
50L, 50L, 50L, 50L, 50L, 50L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 52L, 52L, 53L, 53L, 53L, 53L, 54L, 55L, 56L,
56L, 56L, 56L, 56L, 56L, 56L, 56L, 57L, 58L, 59L, 59L, 60L, 60L,
60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 61L, 61L, 61L, 61L,
61L, 62L, 62L, 63L, 64L, 65L, 66L, 66L, 66L, 66L, 66L, 66L, 66L,
66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 67L, 67L, 68L,
68L, 69L, 69L, 69L, 70L, 70L, 70L, 70L, 70L, 70L, 71L, 71L, 71L,
71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L,
71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 72L, 72L,
72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L,
73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 74L,
74L, 74L, 74L, 74L, 75L, 75L, 75L, 76L, 77L, 77L, 78L, 79L, 80L,
81L, 82L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 84L, 84L,
84L, 85L, 86L, 86L, 87L, 87L, 87L, 87L, 88L, 89L, 90L, 91L, 92L,
93L, 93L, 93L, 94L, 94L, 95L, 95L, 95L, 95L, 95L, 96L, 97L, 97L,
97L, 98L, 99L, 100L, 100L, 100L, 100L, 101L, 102L, 102L, 103L,
104L, 105L, 105L, 105L, 105L, 105L, 105L, 105L, 105L, 105L, 106L,
107L, 107L, 108L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, 110L,
110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 111L, 111L, 111L,
111L, 112L, 112L, 112L, 112L, 112L, 112L, 112L, 113L, 113L, 113L,
113L, 113L, 113L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L,
115L, 116L, 116L, 117L, 117L, 117L, 118L, 118L, 118L, 118L, 118L,
118L, 118L, 118L, 118L, 118L, 119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 120L, 120L, 120L, 121L, 122L, 122L, 122L, 122L,
122L, 122L, 122L, 123L, 123L, 123L, 123L, 123L, 123L, 123L), .Label = c("11114110010",
"11114110022", "11114110029", "11114110036", "11114110210", "11114110230",
"11114110261", "11114110271", "11114110281", "11114110291", "11114110316",
"11114110526", "11780900029", "11780900050", "11780900660", "11780900661",
"12451500878", "12451567602", "12550000033", "12550000365", "12550000366",
"12550000367", "12550000371", "12550000376", "12550000377", "12550000384",
"12550000388", "12550000392", "12550000393", "12550000397", "12550000401",
"12550000402", "12550000538", "12550006763", "12550006764", "12550020040",
"12550020042", "12550020043", "12550020044", "12550020188", "12550020204",
"12550020212", "12550090015", "12800046631", "12800063141", "12800070612",
"14300002922", "14300002923", "14300002924", "14300002925", "14300002934",
"14300002940", "14300002941", "14300002942", "14300003300", "14300004091",
"14300004296", "14300004299", "14300004301", "14300004648", "14300004650",
"14300004651", "14300070522", "15543760143", "15543760145", "15543760186",
"15543760235", "15543760253", "17089302817", "17103800044", "17103800047",
"17103800048", "17103800053", "17103800056", "17103800058", "17103800059",
"17103801173", "17103801175", "17232305018", "17447100091", "17510100575",
"17510100576", "17510121064", "17510121065", "17510181458", "17732447059",
"17762300048", "17762300060", "18903644280", "19955508003", "19955508050",
"19955508060", "19955508061", "19955508531", "19955508534", "19955508758",
"19955508792", "19955508800", "19955508801", "19955508832", "19955508992",
"19955509803", "19955538570", "19955538696", "19955538725", "19955538792",
"21291912261", "21780900078", "22550081121", "22550081122", "22800025406",
"22800030050", "24300070590", "25543760142", "25543760521", "29955539550",
"31291912240", "39955508520", "41114110525", "57103800060", "74300000330",
"8,11693E+11", "8,83E+12"), class = "factor"), col2 = structure(c(1L,
1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L,
7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,
20L, 20L, 20L, 20L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 22L, 23L,
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 25L,
26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 28L, 28L,
28L, 29L, 30L, 30L, 30L, 31L, 32L, 33L, 33L, 33L, 33L, 33L, 34L,
34L, 34L, 34L, 35L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L,
37L, 38L, 38L, 38L, 38L, 38L, 39L, 39L, 39L, 39L, 40L, 41L, 41L,
41L, 42L, 42L, 43L, 44L, 45L, 45L, 45L, 45L, 45L, 46L, 46L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 48L, 49L, 49L, 49L, 50L, 50L, 50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 52L, 52L, 53L, 53L,
53L, 53L, 54L, 55L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 57L,
58L, 59L, 59L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L,
60L, 61L, 61L, 61L, 61L, 61L, 62L, 62L, 63L, 64L, 65L, 66L, 66L,
66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L,
66L, 66L, 67L, 67L, 68L, 68L, 69L, 69L, 69L, 70L, 70L, 70L, 70L,
70L, 70L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L,
71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L, 71L,
71L, 71L, 71L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L, 72L,
72L, 72L, 72L, 72L, 72L, 73L, 73L, 73L, 73L, 73L, 73L, 73L, 73L,
73L, 73L, 73L, 73L, 74L, 74L, 74L, 74L, 74L, 75L, 75L, 75L, 76L,
77L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 83L, 83L, 83L, 83L, 83L,
83L, 83L, 83L, 84L, 84L, 84L, 85L, 86L, 86L, 87L, 87L, 87L, 87L,
88L, 89L, 90L, 91L, 92L, 93L, 93L, 93L, 94L, 94L, 95L, 95L, 95L,
95L, 95L, 96L, 97L, 97L, 97L, 98L, 99L, 100L, 100L, 100L, 100L,
101L, 102L, 102L, 103L, 104L, 105L, 105L, 105L, 105L, 105L, 105L,
105L, 105L, 105L, 106L, 107L, 107L, 108L, 109L, 109L, 109L, 109L,
109L, 109L, 109L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L,
110L, 111L, 111L, 111L, 111L, 112L, 112L, 112L, 112L, 112L, 112L,
112L, 113L, 113L, 113L, 113L, 113L, 113L, 114L, 114L, 114L, 114L,
114L, 114L, 114L, 114L, 115L, 116L, 116L, 117L, 117L, 117L, 118L,
118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L, 119L, 119L,
119L, 119L, 119L, 119L, 119L, 119L, 119L, 120L, 120L, 120L, 121L,
123L, 122L, 123L, 123L, 123L, 123L, 123L, 127L, 124L, 126L, 126L,
127L, 127L, 125L), .Label = c("00011114110010", "00011114110022",
"00011114110029", "00011114110036", "00011114110210", "00011114110230",
"00011114110261", "00011114110271", "00011114110281", "00011114110291",
"00011114110316", "00011114110526", "00011780900029", "00011780900050",
"00011780900660", "00011780900661", "00012451500878", "00012451567602",
"00012550000033", "00012550000365", "00012550000366", "00012550000367",
"00012550000371", "00012550000376", "00012550000377", "00012550000384",
"00012550000388", "00012550000392", "00012550000393", "00012550000397",
"00012550000401", "00012550000402", "00012550000538", "00012550006763",
"00012550006764", "00012550020040", "00012550020042", "00012550020043",
"00012550020044", "00012550020188", "00012550020204", "00012550020212",
"00012550090015", "00012800046631", "00012800063141", "00012800070612",
"00014300002922", "00014300002923", "00014300002924", "00014300002925",
"00014300002934", "00014300002940", "00014300002941", "00014300002942",
"00014300003300", "00014300004091", "00014300004296", "00014300004299",
"00014300004301", "00014300004648", "00014300004650", "00014300004651",
"00014300070522", "00015543760143", "00015543760145", "00015543760186",
"00015543760235", "00015543760253", "00017089302817", "00017103800044",
"00017103800047", "00017103800048", "00017103800053", "00017103800056",
"00017103800058", "00017103800059", "00017103801173", "00017103801175",
"00017232305018", "00017447100091", "00017510100575", "00017510100576",
"00017510121064", "00017510121065", "00017510181458", "00017732447059",
"00017762300048", "00017762300060", "00018903644280", "00019955508003",
"00019955508050", "00019955508060", "00019955508061", "00019955508531",
"00019955508534", "00019955508758", "00019955508792", "00019955508800",
"00019955508801", "00019955508832", "00019955508992", "00019955509803",
"00019955538570", "00019955538696", "00019955538725", "00019955538792",
"00021291912261", "00021780900078", "00022550081121", "00022550081122",
"00022800025406", "00022800030050", "00024300070590", "00025543760142",
"00025543760521", "00029955539550", "00031291912240", "00039955508520",
"00041114110525", "00057103800060", "00074300000330", "08011693200041",
"08011693200042", "88029999819907", "88029999820074", "88029999820083",
"88029999820128"), class = "factor")), row.names = c(NA, -513L
), class = "data.frame")
A few issues here. Your columns appear to be factors, which can create confusing problems when you apply string functions to them. You want them to be character, not factor. The correct way to check the length of a string is with nchar (spoiler alert: does not work with factor data!).
Your rules for padding seem a little arbitrary, but the following should work. For padding "within" the digit string, gsub and regular expressions work wonders.
df2 <- mutate_at(df, vars(col1, col2), as.character) %>%
mutate(col3 = case_when(
nchar(col1) == 11 ~ str_pad(col1, width = 14, pad = '0'),
nchar(col1) == 12 ~ gsub('(\\d)(\\d+)', '0\\10\\2', col1),
nchar(col1) == 13 ~ gsub('(\\d\\d)(\\d+)', '\\10\\2', col1),
T ~ col1
))
col1 col3
<chr> <chr>
1 74300000330 00074300000330
2 811693200042 08011693200042
3 8829999820128 88029999820128

Caculating probability using normal distribution. and t-distribution in R

I have this sample:
x=c(92L, 9L, 38L, 43L, 74L, 16L, 75L, 55L, 39L, 77L, 76L, 52L,
100L, 85L, 62L, 60L, 49L, 28L, 6L, 27L, 63L, 22L, 23L, 99L, 61L,
25L, 19L, 48L, 91L, 57L, 97L, 84L, 31L, 87L, 1L, 21L, 30L, 41L,
13L, 72L, 68L, 95L, 47L, 11L, 24L, 58L, 18L, 67L, 33L, 8L, 50L,
4L, 40L, 12L, 73L, 78L, 86L, 69L, 44L, 83L, 94L, 65L, 37L, 70L,
54L, 46L, 15L, 53L, 89L, 98L, 90L, 3L, 14L, 17L, 42L, 45L, 79L,
20L, 32L, 34L, 64L, 88L, 81L, 96L, 59L, 71L, 56L, 26L, 51L, 29L,
80L, 7L, 36L, 93L, 82L, 35L, 5L, 2L, 10L, 66L)
I want to calculate this probability: P(x) > Mean(x) + 3 assuming that data have normal distribution.
So I do this: mean(x) = 50.5 ; sd(x)=29.01
I generate the density distribution and calculate my probability, which now is:
P(x) > 53.5
pnorm(53.5, mean=mean(x), sd=sd(x), lower.tail=FALSE)
If I want calculating using Standard Distribution:
P(x)>(53.5) = P(z=(x-mean(x)/sd(x))) > ((53.5 - 50.5)/29.01) = P(z)>(3/29.01)
pnorm(3/29.01149, mean=0, sd=1, lower.tail=FALSE)
But when I want to use the T-Student Distribution, how can I proceed?
It is more legitimate to use t distribution here, as standard error is estimated from data.
pt(3 / sd(x), df = length(x) - 1, lower.tail = FALSE)
# [1] 0.4589245
We have length(x) number of data, but also estimate 1 parameter (standard error), so the degree of freedom for t-distribution is length(x) - 1.
There is not much difference compared with using normal distribution, though, given that length(x) is 100 (which is large enough):
pnorm(3 / sd(x), lower.tail = FALSE)
# [1] 0.4588199

How to find Number of unique occurrences of a value in data-set?

I have the following piece of my data-set:
> dput(test)
structure(list(X2002.06.26 = structure(c(99L, 88L, 65L, 94L,
60L, 101L, 27L, 83L, 16L, 12L, 54L, 97L, 63L, 41L, 13L, 2L, 58L,
9L, 82L, 22L, 14L, 77L, 55L, 32L, 45L, 80L, 39L, 70L, 114L, 103L,
69L, 104L, 106L, 108L, 38L, 10L, 64L, 1L, 112L, 102L, 67L, 98L,
66L, 19L, 81L, 72L, 89L, 23L, 48L, 4L, 25L, 91L, 26L, 62L, 33L,
3L, 28L, 57L, 17L, 20L, 73L, 78L, 90L, 84L, 5L, 92L, 43L, 74L,
75L, 93L, 100L, 56L, 36L, 79L, 111L, 52L, 24L, 105L, 29L, 53L,
110L, 71L, 18L, 8L, 34L, 50L, 109L, 61L, 35L, 21L, 11L, 47L,
59L, 51L, 113L, 44L, 30L, 42L, 107L, 7L, 87L, 6L, 68L, 96L, 86L,
15L, 46L, 85L, 31L, 49L, 40L, 76L, 95L, 115L, 37L), .Label = c("BMG4388N1065",
"BMG812761002", "GB00BYMT0J19", "IE00BLS09M33", "IE00BQRQXQ92",
"US0003611052", "US0015471081", "US0025671050", "US0028962076",
"US0044981019", "US0116591092", "US01741R1023", "US0185223007",
"US01988P1084", "US0305061097", "US0311001004", "US03662Q1058",
"US0375981091", "US0383361039", "US03836W1036", "US03937C1053",
"US0396701049", "US0462241011", "US06652V2088", "US0997241064",
"US1033041013", "US1096961040", "US1170431092", "US1250711009",
"US1258961002", "US12686C1099", "US1311931042", "US1416651099",
"US1423391002", "US1431301027", "US1564311082", "US1718711062",
"US1778351056", "US2193501051", "US2289031005", "US23331A1097",
"US2537981027", "US2829141009", "US2925621052", "US2966891028",
"US3116421021", "US34354P1057", "US3498531017", "US3693851095",
"US3984331021", "US3989051095", "US4158641070", "US4222451001",
"US4285671016", "US4586653044", "US4835481031", "US5261071071",
"US5367971034", "US5463471053", "US55305B1017", "US5535301064",
"US5562691080", "US5663301068", "US5871181005", "US59001A1025",
"US6081901042", "US62914B1008", "US6517185046", "US6900701078",
"US6907684038", "US6936561009", "US7081601061", "US7132781094",
"US7234561097", "US7310681025", "US7415034039", "US7496851038",
"US7549071030", "US7595091023", "US76009N1000", "US7703231032",
"US7811821005", "US7835491082", "US8081941044", "US8308791024",
"US83088M1027", "US83545G1022", "US8354951027", "US8528572006",
"US8545021011", "US85590A4013", "US8581191009", "US8589121081",
"US8681571084", "US8685361037", "US8712371033", "US8793691069",
"US8799391060", "US8832031012", "US8851601018", "US8865471085",
"US8873891043", "US88830M1027", "US8968181011", "US89785X1019",
"US8990355054", "US90385D1072", "US9134831034", "US9202531011",
"US92552R4065", "US9410531001", "US9427491025", "US9433151019",
"US9633201069", "US9837721045"), class = "factor"), X2002.06.27 = structure(c(57L,
43L, 73L, 70L, 35L, 114L, 58L, 88L, 55L, 7L, 72L, 28L, 16L, 84L,
110L, 44L, 75L, 20L, 99L, 18L, 10L, 80L, 113L, 52L, 66L, 36L,
60L, 101L, 107L, 103L, 34L, 22L, 81L, 40L, 1L, 46L, 108L, 106L,
91L, 37L, 98L, 9L, 104L, 115L, 54L, 100L, 42L, 2L, 3L, 26L, 21L,
71L, 23L, 62L, 50L, 97L, 11L, 94L, 27L, 53L, 79L, 4L, 51L, 76L,
49L, 78L, 87L, 32L, 59L, 96L, 13L, 86L, 15L, 48L, 109L, 29L,
85L, 68L, 17L, 41L, 64L, 31L, 8L, 38L, 90L, 45L, 12L, 56L, 6L,
39L, 92L, 63L, 5L, 82L, 19L, 89L, 69L, 74L, 25L, 95L, 105L, 61L,
67L, 14L, 112L, 111L, 102L, 83L, 93L, 33L, 30L, 47L, 65L, 24L,
77L), .Label = c("CH0044328745", "GB00BVVBC028", "LR0008862868",
"US0003611052", "US0010841023", "US0044981019", "US0079731008",
"US0116591092", "US0305061097", "US0311001004", "US0383361039",
"US03937C1053", "US0462241011", "US06652V2088", "US0733021010",
"US0952291005", "US0997241064", "US1096411004", "US1096961040",
"US1265011056", "US12686C1099", "US1311931042", "US1431301027",
"US1564311082", "US1628251035", "US1630721017", "US1897541041",
"US2017231034", "US23331A1097", "US2829141009", "US2925621052",
"US29444U7000", "US2974251009", "US3024913036", "US3138551086",
"US34354P1057", "US3596941068", "US3693851095", "US3719011096",
"US3825501014", "US3984331021", "US3989051095", "US4108671052",
"US4130861093", "US4158641070", "US4456581077", "US4586653044",
"US4606901001", "US48666K1097", "US5006432000", "US5053361078",
"US5138471033", "US5179421087", "US5246601075", "US5260571048",
"US5463471053", "US5526761086", "US5535301064", "US5663301068",
"US5766901012", "US59001A1025", "US6117421072", "US63935N1072",
"US6515871076", "US67066G1040", "US6795801009", "US6819191064",
"US6900701078", "US6907684038", "US6935061076", "US6936561009",
"US6951561090", "US7004162092", "US73179P1066", "US7376301039",
"US7401891053", "US74762E1029", "US7496851038", "US7549071030",
"US7757111049", "US7811821005", "US8305661055", "US8308791024",
"US8335511049", "US83545G1022", "US8354951027", "US8358981079",
"US8545021011", "US85590A4013", "US86732Y1091", "US8681681057",
"US8712371033", "US87305R1095", "US8799391060", "US8851601018",
"US88830M1027", "US8894781033", "US8962391004", "US8968181011",
"US89785X1019", "US9022521051", "US90385D1072", "US9046772003",
"US9111631035", "US9134831034", "US92552R4065", "US92552V1008",
"US9258151029", "US9292361071", "US9410531001", "US9427491025",
"US9433151019", "US9699041011", "US9746371007", "US9807451037"
), class = "factor")), .Names = c("X2002.06.26", "X2002.06.27"
), class = "data.frame", row.names = c(NA, -115L))
The actual data extends over 3000+ columns and there are approximately 1150 unique values.
I need to count how many times each of these values appears in the Data-Set.
We can try to flat the elements in the data frame first, then apply the table() method:
tab1 <- table(do.call(c, lapply(df, as.character)))
Another option is to convert the data frame to matrix then apply table method:
tab2 <- table(as.matrix(df))
identical(tab1, tab2)
[1] TRUE

Plotting a Dataframe in R

I have a dataframe of the form
Region Name 3-15 4-15 5-15 ... 3-16
Name1 30 82 56 ... 32
Name2 65 23 38 ... 11
... ... ... ... ... ...
Name18 87 33 11 ... 51
The first column being the names of regions and the other columns being recorded events over time (monthly by column)
I'd like to plot the recorded monthly values over time with respect to their associated name. Specifically, a different line for each Named region with a differentiated colour. Any advice would be appreciated, a lot of the plotting functions for data frames seem to function on frames of a different format.
dput() data:
dataframe <- structure(list("LSOA Name" = c("Lancaster 001", "Lancaster 002",
"Lancaster 003", "Lancaster 004", "Lancaster 005", "Lancaster 006",
"Lancaster 008", "Lancaster 009", "Lancaster 010", "Lancaster 011",
"Lancaster 013", "Lancaster 014", "Lancaster 015", "Lancaster 016",
"Lancaster 017", "Lancaster 018", "Lancaster 019", "Lancaster 020"
), "3-15" = c(49L, 16L, 17L, 28L, 21L, 197L, 57L, 143L, 78L,
121L, 67L, 223L, 41L, 86L, 66L, 27L, 40L, 77L), "4-15" = c(63L,
11L, 26L, 29L, 19L, 203L, 69L, 154L, 82L, 125L, 62L, 198L, 44L,
99L, 64L, 26L, 42L, 99L), "5-15" = c(67L, 10L, 20L, 30L, 10L,
194L, 62L, 186L, 61L, 110L, 75L, 273L, 29L, 126L, 92L, 34L, 41L,
88L), "6-15" = c(58L, 8L, 18L, 36L, 29L, 198L, 62L, 167L, 83L,
110L, 59L, 254L, 26L, 99L, 73L, 17L, 30L, 109L), "7-15" = c(53L,
29L, 27L, 23L, 38L, 188L, 56L, 149L, 90L, 129L, 37L, 226L, 32L,
119L, 57L, 14L, 30L, 96L), "8-15" = c(44L, 9L, 25L, 28L, 29L,
237L, 69L, 171L, 78L, 108L, 45L, 261L, 22L, 103L, 68L, 33L, 35L,
108L), "9-15" = c(59L, 12L, 18L, 35L, 19L, 230L, 45L, 128L, 74L,
144L, 56L, 223L, 26L, 90L, 51L, 27L, 23L, 120L), "10-15" = c(45L,
26L, 31L, 23L, 25L, 195L, 53L, 155L, 74L, 120L, 58L, 276L, 38L,
92L, 72L, 25L, 40L, 123L), "11-15" = c(31L, 11L, 33L, 15L, 19L,
188L, 52L, 127L, 66L, 102L, 50L, 241L, 26L, 74L, 72L, 26L, 35L,
68L), "12-15" = c(34L, 22L, 21L, 22L, 17L, 205L, 80L, 150L, 73L,
109L, 50L, 228L, 29L, 57L, 59L, 14L, 45L, 93L), "1-16" = c(20L,
9L, 25L, 21L, 11L, 199L, 46L, 124L, 65L, 117L, 40L, 224L, 28L,
88L, 43L, 22L, 18L, 94L), "2-16" = c(54L, 11L, 29L, 20L, 11L,
164L, 44L, 117L, 70L, 85L, 46L, 192L, 23L, 89L, 50L, 27L, 29L,
86L), "3-16" = c(53L, 11L, 24L, 26L, 19L, 203L, 45L, 144L, 66L,
109L, 47L, 213L, 15L, 120L, 59L, 15L, 33L, 127L)), .Names = c("LSOA Name",
"3-15", "4-15", "5-15", "6-15", "7-15", "8-15", "9-15", "10-15",
"11-15", "12-15", "1-16", "2-16", "3-16"), row.names = c(NA,
-18L), class = "data.frame")
A typical way of plotting lines by groups in ggplot is to shift the data to long format, where one column identifies the group, and the other columns identify the x and y axis values.
This example shifts your data into long format with three columns: LSOAName, month_col, and values_col. It adds a day value onto the month-year, and converts that column to a date. Then it plots a line for each group.
I've renamed your dataframe d, because dataframe could be easily misinterpreted as the function data.frame().
# load libraries
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
# rename dataframe so it doesn't look so much like the base function
d <- dataframe
# remove spaces in column names
names(d) <- gsub(" ", "", names(d))
# shift data from wide to long and then
# add a day value and convert day-month-year to date class
d %<>% gather(month_col, values_col, -LSOAName) %>%
mutate(month_col = as.Date(paste0("1-", month_col), "%d-%m-%y"))
# plot using ggplot2
ggplot(d, aes(x = month_col, y = values_col, colour = LSOAName)) +
geom_line()
Edit
%<>% is found in the magrittr package. It is a compound pipe assignment operator. While %>% returns the result of a pipeline, %<>% assigns the result back to the left side object.
Instead of writing
d <- d %>% [pipeline]
you can assign the results to d by writing
d %<>% [pipeline]

How to union tables in R?

I am trying to union two tables. Every month I get new data coming in. It will be handy for me to add the new data to the existing dataframe. I am not seeking to merge them as they are the same variables.
A little example as follow: M and N have the same dimension. I would like to combine M and N together
Thanks in advance
M <- structure(list(ID= c(56L, 67L, 68L, 73L, 77L, 87L), Mary = c(73L,
82L, 80L, 78L, 79L, 80L), Dave = c(45L, 42L, 51L, 46L, 60L, 54L
), Anne = c(78L, 85L, 92L, 83L, 77L, 89L), Bob = c(51L, 49L,
58L, 54L, 62L, 68L)), .Names = c("ID", "Mary", "Dave", "Anne",
"Bob"), class = "data.frame", row.names = c(NA, -6L))
N <- structure(list(ID= c(53L, 22L, 21L, 73L, 727L, 27L), Mary = c(72L,
82L, 80L, 78L, 79L, 80L), Dave = c(45L, 42L, 51L, 46L, 62L, 54L
), Anne = c(78L, 85L, 92L, 22L, 77L, 89L), Bob = c(52L, 49L,
58L, 54L, 62L, 628L)), .Names = c("ID", "Mary", "Dave", "Anne",
"Bob"), class = "data.frame", row.names = c(NA, -6L))
This might be all you need:
MN <- rbind(M, N)
If the two data.frames have different columns, then I would recommend this instead:
library(plyr)
MN <- rbind.fill(M, N)
Finally, if you need to remove duplicates:
MN <- MN[!duplicated(MN),]

Resources