How to see the distribution of a column in R? - r

I have a dataset that looks like this:
A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
...
What can I run to see the distribution of wage (the first column). Specifically, I want to see how many people have a wage of under $300.
What ggplot function can I run?

You can get the cumulative histogram:
library(ggplot2)
ggplot(df,aes(wage))+geom_histogram(aes(y=cumsum(..count..)))+
stat_bin(aes(y=cumsum(..count..)),geom="line",color="green")
If you specifically want to know the count of entries with a certain condition, in base r you can use the following:
count(df[df$wage > 1000,])
## # A tibble: 1 x 1
## n
## <int>
## 1 3
Data:
df <- structure(list(wage = c(769L, 808L, 825L, 650L, 562L, 1400L,
600L, 1081L, 1154L, 1000L), hours = c(40L, 50L, 40L, 40L, 40L,
40L, 40L, 40L, 45L, 40L), iq = c(93L, 119L, 108L, 96L, 74L, 116L,
91L, 114L, 111L, 95L), kww = c(35L, 41L, 46L, 32L, 27L, 43L,
24L, 50L, 37L, 44L), educ = c(12L, 18L, 14L, 12L, 11L, 16L, 10L,
18L, 15L, 12L), exper = c(11L, 11L, 11L, 13L, 14L, 14L, 13L,
8L, 13L, 16L), tenure = c(2L, 16L, 9L, 7L, 5L, 2L, 0L, 14L, 1L,
16L), age = c(31L, 37L, 33L, 32L, 34L, 35L, 30L, 38L, 36L, 36L
), married = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L), black = c(0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), south = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), urban = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L), sibs = c(1L, 1L, 1L, 4L, 10L, 1L, 1L, 2L, 2L, 1L
), brthord = c(2L, NA, 2L, 3L, 6L, 2L, 2L, 3L, 3L, 1L), meduc = c(8L,
14L, 14L, 12L, 6L, 8L, 8L, 8L, 14L, 12L)), .Names = c("wage",
"hours", "iq", "kww", "educ", "exper", "tenure", "age", "married",
"black", "south", "urban", "sibs", "brthord", "meduc"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))

Try this:
library(dplyr)
library(ggplot2)
df <- df %>% filter(wage < 300)
qplot(wage, data = df)

Related

How to create a cross tabulation table between two variables with the counts in R?

I have a data frame with two columns that I want to cross tabulate. The data also includes the counts for the combination. I am trying to create the cross table and include those counts within the table. I am struggling to use the counts from the dataframe into the cross table.
> df %>% arrange(d1)%>% head()
count d1 d2
1 3 1 15
2 86 1 14
3 13 1 12
4 186 1 16
5 29 1 9
6 86 1 13
> table(df$d1,df$d2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Expecting [1,15] and [1,14] to show 3, 86 based on the counts in df table.
Right now it shows 0s and 1s only based on if the combinations exists.
Here is my sample data:
structure(list(count = c(37L, 6L, 44L, 21L, 8L, 3L, 9L, 17L,
13L, 32L, 106L, 34L, 505L, 173L, 12L, 2L, 4L, 45L, 3L, 43L, 5L,
16L, 1L, 27L, 17L, 3L, 4L, 1L, 27L, 86L, 79L, 10L, 161L, 32L,
3L, 209L, 9L, 83L, 23L, 108L, 161L, 22L, 4L, 16L, 2L, 6L, 67L,
86L, 3L, 1L, 14L, 14L, 111L, 5L, 5L, 44L, 105L, 13L, 269L, 186L,
3L, 5L, 5L, 27L, 3L, 186L, 58L, 29L, 34L, 43L, 8L, 92L, 9L, 455L,
22L, 32L, 4L, 14L, 58L, 22L, 190L, 94L, 27L, 152L, 264L, 36L,
1L, 505L, 86L, 44L, 3L, 1L, 79L, 75L, 12L, 32L, 11L, 197L, 90L,
269L, 9L, 6L, 47L, 14L, 158L, 303L, 335L, 37L, 33L, 3L, 83L,
15L, 31L, 124L, 146L, 26L, 36L, 27L, 37L, 31L, 108L, 121L, 111L,
11L, 5L, 26L, 166L, 11L, 18L, 11L, 8L, 15L, 18L, 165L, 80L, 14L,
5L, 3L, 492L, 7L, 90L, 146L, 130L, 197L, 165L, 34L, 22L, 122L,
29L, 74L, 455L, 303L, 45L, 5L, 173L, 33L, 24L, 229L, 79L, 43L,
68L, 16L, 10L, 73L, 35L, 99L, 229L, 94L, 23L, 492L, 18L, 84L,
92L, 86L, 35L, 31L, 1L, 23L, 8L, 121L, 1L, 173L, 400L, 124L,
20L, 11L, 6L, 3L, 166L, 84L, 31L, 122L, 15L, 24L, 70L, 43L, 74L,
209L, 45L, 158L, 44L, 15L, 37L, 35L, 27L, 68L, 20L, 15L, 11L,
21L, 4L, 18L, 44L, 234L, 80L, 10L, 44L, 4L, 47L, 7L, 67L, 10L,
3L, 173L, 99L, 79L, 130L, 3L, 75L, 1L, 335L, 14L, 106L, 15L,
34L, 190L, 152L, 16L, 73L, 45L, 1L, 3L, 264L, 160L, 23L, 1L,
160L, 400L, 105L, 234L, 70L, 35L), d1 = c(10L, 17L, 5L, 3L, 12L,
1L, 10L, 10L, 12L, 7L, 14L, 6L, 16L, 3L, 7L, 9L, 7L, 13L, 4L,
8L, 9L, 2L, 7L, 16L, 8L, 15L, 12L, 12L, 2L, 1L, 16L, 15L, 14L,
5L, 8L, 14L, 11L, 11L, 4L, 4L, 13L, 7L, 12L, 11L, 17L, 8L, 4L,
13L, 15L, 15L, 12L, 13L, 4L, 5L, 5L, 5L, 2L, 1L, 2L, 1L, 2L,
13L, 12L, 5L, 3L, 16L, 10L, 1L, 14L, 2L, 7L, 9L, 15L, 16L, 3L,
11L, 8L, 12L, 9L, 9L, 14L, 11L, 8L, 11L, 16L, 10L, 17L, 6L, 1L,
3L, 5L, 1L, 3L, 11L, 10L, 14L, 5L, 3L, 6L, 16L, 15L, 15L, 4L,
14L, 14L, 16L, 16L, 8L, 3L, 7L, 1L, 15L, 6L, 11L, 6L, 5L, 1L,
15L, 2L, 7L, 14L, 2L, 13L, 10L, 6L, 1L, 3L, 15L, 2L, 3L, 9L,
7L, 11L, 3L, 10L, 16L, 17L, 7L, 3L, 15L, 1L, 2L, 10L, 13L, 4L,
5L, 8L, 4L, 9L, 16L, 13L, 4L, 10L, 17L, 6L, 8L, 7L, 11L, 8L,
9L, 16L, 7L, 14L, 9L, 4L, 3L, 13L, 4L, 8L, 16L, 8L, 6L, 14L,
14L, 9L, 13L, 17L, 12L, 10L, 1L, 17L, 11L, 16L, 2L, 1L, 7L, 14L,
12L, 2L, 9L, 8L, 6L, 4L, 13L, 9L, 6L, 5L, 6L, 12L, 11L, 4L, 2L,
14L, 12L, 11L, 7L, 8L, 6L, 1L, 12L, 9L, 12L, 5L, 3L, 6L, 15L,
13L, 8L, 10L, 4L, 1L, 13L, 17L, 13L, 1L, 10L, 14L, 17L, 9L, 2L,
10L, 17L, 2L, 12L, 5L, 3L, 6L, 7L, 3L, 16L, 15L, 5L, 9L, 2L,
6L, 5L, 13L, 11L, 4L, 6L, 13L, 4L), d2 = c(2L, 14L, 4L, 12L,
10L, 15L, 15L, 8L, 1L, 14L, 2L, 5L, 6L, 11L, 10L, 17L, 8L, 10L,
17L, 6L, 5L, 7L, 15L, 15L, 10L, 1L, 9L, 17L, 5L, 14L, 8L, 14L,
13L, 11L, 5L, 6L, 15L, 1L, 8L, 14L, 14L, 3L, 8L, 7L, 9L, 15L,
1L, 1L, 2L, 5L, 13L, 12L, 13L, 12L, 9L, 3L, 4L, 12L, 16L, 16L,
15L, 17L, 5L, 2L, 17L, 1L, 9L, 9L, 5L, 9L, 9L, 14L, 11L, 13L,
7L, 5L, 12L, 14L, 10L, 8L, 3L, 4L, 11L, 6L, 9L, 1L, 1L, 16L,
13L, 5L, 8L, 17L, 10L, 9L, 7L, 7L, 10L, 13L, 1L, 2L, 10L, 8L,
10L, 12L, 11L, 4L, 10L, 14L, 8L, 12L, 11L, 6L, 7L, 2L, 2L, 1L,
10L, 16L, 10L, 6L, 4L, 1L, 4L, 5L, 17L, 5L, 2L, 3L, 8L, 15L,
7L, 4L, 12L, 4L, 6L, 17L, 6L, 5L, 16L, 4L, 6L, 6L, 14L, 3L, 3L,
14L, 9L, 6L, 1L, 5L, 16L, 16L, 13L, 13L, 13L, 3L, 13L, 13L, 16L,
2L, 7L, 2L, 15L, 3L, 12L, 1L, 11L, 11L, 4L, 3L, 2L, 9L, 9L, 1L,
4L, 8L, 12L, 6L, 12L, 2L, 2L, 3L, 11L, 11L, 8L, 1L, 17L, 7L,
3L, 6L, 13L, 4L, 7L, 7L, 13L, 8L, 16L, 14L, 16L, 14L, 5L, 12L,
8L, 4L, 8L, 16L, 1L, 15L, 7L, 3L, 12L, 11L, 13L, 6L, 10L, 13L,
5L, 7L, 4L, 15L, 4L, 15L, 4L, 6L, 3L, 3L, 10L, 3L, 11L, 17L,
16L, 16L, 14L, 2L, 6L, 14L, 11L, 11L, 9L, 12L, 7L, 7L, 16L, 13L,
12L, 15L, 2L, 16L, 2L, 3L, 9L, 9L)), row.names = c(NA, 252L), class = "data.frame")
xtabs may be useful here
> xtabs(count ~ d1 + d2, df)
d2
d1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 0 121 99 67 26 90 11 20 29 36 83 13 86 86 3 186 1
2 121 0 166 105 27 146 16 18 43 37 124 15 160 106 3 269 1
3 99 166 0 165 44 234 22 33 73 79 173 21 197 190 11 492 3
4 67 105 165 0 44 122 15 23 35 47 94 35 111 108 7 303 3
5 26 27 44 44 0 34 3 3 5 11 32 5 44 34 1 74 0
6 90 146 234 122 34 0 31 43 84 80 152 23 173 209 15 505 5
7 11 16 22 15 3 31 0 4 8 12 16 3 24 32 1 68 0
8 20 18 33 23 3 43 4 0 22 17 27 4 31 37 6 79 0
9 29 43 73 35 5 84 8 22 0 58 75 4 70 92 0 264 2
10 36 37 79 47 11 80 12 17 58 0 0 8 45 130 9 335 0
11 83 124 173 94 32 152 16 27 75 0 0 18 229 158 9 400 0
12 13 15 21 35 5 23 3 4 4 8 18 0 14 14 0 45 1
13 86 160 197 111 44 173 24 31 70 45 229 14 0 161 10 455 5
14 86 106 190 108 34 209 32 37 92 130 158 14 161 0 10 0 6
15 3 3 11 7 1 15 1 6 0 9 9 0 10 10 0 27 0
16 186 269 492 303 74 505 68 79 264 335 400 45 455 0 27 0 14
17 1 1 3 3 0 5 0 0 2 0 0 1 5 6 0 14 0
Convert to data.frame if required
as.data.frame.matrix(xtabs(count ~ d1 + d2, df))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 0 121 99 67 26 90 11 20 29 36 83 13 86 86 3 186 1
2 121 0 166 105 27 146 16 18 43 37 124 15 160 106 3 269 1
3 99 166 0 165 44 234 22 33 73 79 173 21 197 190 11 492 3
4 67 105 165 0 44 122 15 23 35 47 94 35 111 108 7 303 3
5 26 27 44 44 0 34 3 3 5 11 32 5 44 34 1 74 0
6 90 146 234 122 34 0 31 43 84 80 152 23 173 209 15 505 5
7 11 16 22 15 3 31 0 4 8 12 16 3 24 32 1 68 0
8 20 18 33 23 3 43 4 0 22 17 27 4 31 37 6 79 0
9 29 43 73 35 5 84 8 22 0 58 75 4 70 92 0 264 2
10 36 37 79 47 11 80 12 17 58 0 0 8 45 130 9 335 0
11 83 124 173 94 32 152 16 27 75 0 0 18 229 158 9 400 0
12 13 15 21 35 5 23 3 4 4 8 18 0 14 14 0 45 1
13 86 160 197 111 44 173 24 31 70 45 229 14 0 161 10 455 5
14 86 106 190 108 34 209 32 37 92 130 158 14 161 0 10 0 6
15 3 3 11 7 1 15 1 6 0 9 9 0 10 10 0 27 0
16 186 269 492 303 74 505 68 79 264 335 400 45 455 0 27 0 14
17 1 1 3 3 0 5 0 0 2 0 0 1 5 6 0 14 0
Or may use dcast
library(data.table)
dcast(df, d1 ~ d2, value.var = 'count')
Key: <d1>
d1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 1 NA 121 99 67 26 90 11 20 29 36 83 13 86 86 3 186 1
2: 2 121 NA 166 105 27 146 16 18 43 37 124 15 160 106 3 269 1
3: 3 99 166 NA 165 44 234 22 33 73 79 173 21 197 190 11 492 3
4: 4 67 105 165 NA 44 122 15 23 35 47 94 35 111 108 7 303 3
5: 5 26 27 44 44 NA 34 3 3 5 11 32 5 44 34 1 74 NA
6: 6 90 146 234 122 34 NA 31 43 84 80 152 23 173 209 15 505 5
7: 7 11 16 22 15 3 31 NA 4 8 12 16 3 24 32 1 68 NA
8: 8 20 18 33 23 3 43 4 NA 22 17 27 4 31 37 6 79 NA
9: 9 29 43 73 35 5 84 8 22 NA 58 75 4 70 92 NA 264 2
10: 10 36 37 79 47 11 80 12 17 58 NA NA 8 45 130 9 335 NA
11: 11 83 124 173 94 32 152 16 27 75 NA NA 18 229 158 9 400 NA
12: 12 13 15 21 35 5 23 3 4 4 8 18 NA 14 14 NA 45 1
13: 13 86 160 197 111 44 173 24 31 70 45 229 14 NA 161 10 455 5
14: 14 86 106 190 108 34 209 32 37 92 130 158 14 161 NA 10 NA 6
15: 15 3 3 11 7 1 15 1 6 NA 9 9 NA 10 10 NA 27 NA
16: 16 186 269 492 303 74 505 68 79 264 335 400 45 455 NA 27 NA 14
17: 17 1 1 3 3 NA 5 NA NA 2 NA NA 1 5 6 NA 14 NA
If you are looking for a more publishable solution, you might want to try crosstable::crosstable() as it would let you output a nice HTML table.
This would require a few parameters though, as it is not meant for crossing long vectors of numbers in the first place.
Here is the code:
library(dplyr)
library(crosstable)
ct = df %>%
crosstable(d1, by=d2, percent_pattern="{n}", unique_numeric=Inf)
as_flextable(ct)

How to use GGPLOT to make stacked line graphs with same x-axis from multiple df columns?

I have a data.frame with multiple columns recording values for months of the year.
zdate mwe1.x mwe2.x mwe3.x mwe4.x mwe5.x mwe6.x mwe7.x mwe1.y mwe2.y
1 Jan 2017 10 0 1 0 0 4 0 41 5
2 Feb 2017 7 0 0 0 0 0 0 76 33
3 Mar 2017 16 0 0 0 0 6 0 261 59
4 Apr 2017 40 4 0 0 1 0 0 546 80
5 May 2017 8 0 0 0 1 4 0 154 18
6 Jun 2017 7 0 0 0 2 1 0 74 4
7 Jul 2017 20 0 0 0 0 1 0 116 8
8 Aug 2017 25 6 1 0 3 6 1 243 54
9 Sep 2017 8 2 2 0 3 5 0 257 46
10 Oct 2017 2 0 0 0 0 0 0 74 7
11 Nov 2017 13 0 0 0 1 0 0 144 9
12 Dec 2017 6 0 3 0 2 1 0 164 20
mwe3.y mwe4.y mwe5.y mwe6.y mwe7.y
1 17 4 11 4 28
2 61 0 22 7 72
3 91 1 69 16 309
4 71 0 94 19 206
5 29 0 44 3 58
6 21 0 15 2 66
7 12 0 23 2 20
8 20 0 36 2 55
9 42 0 55 7 89
10 13 0 24 0 7
11 39 0 18 1 11
12 54 0 88 5 51
I would like to plot separate line charts for these columns, but with the same x-axis, and stacked on top of each other. I am trying to use 'ggplot2' and the 'facet_wrap' to do this, but I can't seem to work out how exactly to do it. I can get a single line plot:
plot <- ggplot(all, aes(x = all$date, y = all$mwe1.x)) +
+ geom_line()
But I want to have this stacked with the line plot for 'mwe1.y' directly below it. Can anybody help me out?
Maybe you are looking for this. The key is to reshape your data to long in order to use the full potential of ggplot2. Here the code using tidyverse functions:
library(tidyverse)
#Code
df %>% mutate(zdate=factor(zdate,levels = unique(zdate),ordered = T)) %>%
pivot_longer(-zdate) %>%
ggplot(aes(x=zdate,y=value,group=name,color=name))+
geom_line()
Output:
Some data used:
#Data
df <- structure(list(zdate = c("Jan-2017", "Feb-2017", "Mar-2017",
"Apr-2017", "May-2017", "Jun-2017", "Jul-2017", "Aug-2017", "Sep-2017",
"Oct-2017", "Nov-2017", "Dec-2017"), mwe1.x = c(10L, 7L, 16L,
40L, 8L, 7L, 20L, 25L, 8L, 2L, 13L, 6L), mwe2.x = c(0L, 0L, 0L,
4L, 0L, 0L, 0L, 6L, 2L, 0L, 0L, 0L), mwe3.x = c(1L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 2L, 0L, 0L, 3L), mwe4.x = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), mwe5.x = c(0L, 0L, 0L, 1L, 1L, 2L,
0L, 3L, 3L, 0L, 1L, 2L), mwe6.x = c(4L, 0L, 6L, 0L, 4L, 1L, 1L,
6L, 5L, 0L, 0L, 1L), mwe7.x = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L), mwe1.y = c(41L, 76L, 261L, 546L, 154L, 74L,
116L, 243L, 257L, 74L, 144L, 164L), mwe2.y = c(5L, 33L, 59L,
80L, 18L, 4L, 8L, 54L, 46L, 7L, 9L, 20L), mwe3.y = c(17L, 61L,
91L, 71L, 29L, 21L, 12L, 20L, 42L, 13L, 39L, 54L), mwe4.y = c(4L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mwe5.y = c(11L,
22L, 69L, 94L, 44L, 15L, 23L, 36L, 55L, 24L, 18L, 88L), mwe6.y = c(4L,
7L, 16L, 19L, 3L, 2L, 2L, 2L, 7L, 0L, 1L, 5L), mwe7.y = c(28L,
72L, 309L, 206L, 58L, 66L, 20L, 55L, 89L, 7L, 11L, 51L)), class = "data.frame", row.names = c(NA,
-12L))
Or maybe this:
#Code 2
df %>% mutate(zdate=factor(zdate,levels = unique(zdate),ordered = T)) %>%
pivot_longer(-zdate) %>%
mutate(name=factor(name,levels = unique(name),ordered = T)) %>%
ggplot(aes(x=zdate,y=value,group=name,color=name))+
geom_line()+
facet_wrap(.~name,ncol = 7,scales = 'free_y')+
theme(legend.position = 'none',
axis.text.x = element_text(angle=90))
Output:
Update: The OP only wish specific variables, so we can use filter():
#Code 3
df %>% mutate(zdate=factor(zdate,levels = unique(zdate),ordered = T)) %>%
pivot_longer(-zdate) %>%
filter(name %in% c('mwe1.x','mwe1.y')) %>%
mutate(name=factor(name,levels = unique(name),ordered = T)) %>%
ggplot(aes(x=zdate,y=value,group=name,color=name))+
geom_line()+
facet_wrap(.~name,ncol = 7,scales = 'free_y')+
theme(legend.position = 'none',
axis.text.x = element_text(angle=90))
Output:

delete observations by days in R

My dataset has the next structure
df=structure(list(Data = structure(c(12L, 13L, 14L, 15L, 16L, 17L,
18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("01.01.2018",
"02.01.2018", "03.01.2018", "04.01.2018", "05.01.2018", "06.01.2018",
"07.01.2018", "12.02.2018", "13.02.2018", "14.02.2018", "15.02.2018",
"25.12.2017", "26.12.2017", "27.12.2017", "28.12.2017", "29.12.2017",
"30.12.2017", "31.12.2017"), class = "factor"), sku = 1:18, metric = c(100L,
210L, 320L, 430L, 540L, 650L, 760L, 870L, 980L, 1090L, 1200L,
1310L, 1420L, 1530L, 1640L, 1750L, 1860L, 1970L), action = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("Data", "sku", "metric", "action"), class = "data.frame", row.names = c(NA,
-18L))
I need to delete observations that have certain dates.
But in this dataset there is action variable. The action column has only two values 0 and 1.
Observations on these certain dates should be deleted only for the zero category of action.
these dates are presented in a separate datase.
datedata=structure(list(Data = structure(c(18L, 19L, 20L, 21L, 22L, 5L,
7L, 9L, 11L, 13L, 15L, 17L, 23L, 1L, 2L, 3L, 4L, 6L, 8L, 10L,
12L, 14L, 16L), .Label = c("01.05.2018", "02.05.2018", "03.05.2018",
"04.05.2018", "05.03.2018", "05.05.2018", "06.03.2018", "06.05.2018",
"07.03.2018", "07.05.2018", "08.03.2018", "08.05.2018", "09.03.2018",
"09.05.2018", "10.03.2018", "10.05.2018", "11.03.2018", "21.02.2018",
"22.02.2018", "23.02.2018", "24.02.2018", "25.02.2018", "30.04.2018"
), class = "factor")), .Names = "Data", class = "data.frame", row.names = c(NA,
-23L))
how can i do it?
A solution is to use dplyr::filter as:
library(dplyr)
library(lubridate)
df %>% mutate(Data = dmy(Data)) %>%
filter(action==1 | (action==0 & !(Data %in% dmy(datedata$Data))))
# Data sku metric action
# 1 2017-12-25 1 100 0
# 2 2017-12-26 2 210 0
# 3 2017-12-27 3 320 0
# 4 2017-12-28 4 430 0
# 5 2017-12-29 5 540 0
# 6 2017-12-30 6 650 0
# 7 2017-12-31 7 760 0
# 8 2018-01-01 8 870 0
# 9 2018-01-02 9 980 1
# 10 2018-01-03 10 1090 1
# 11 2018-01-04 11 1200 1
# 12 2018-01-05 12 1310 1
# 13 2018-01-06 13 1420 1
# 14 2018-01-07 14 1530 1
# 15 2018-02-12 15 1640 1
# 16 2018-02-13 16 1750 1
# 17 2018-02-14 17 1860 1
# 18 2018-02-15 18 1970 1
I guess this will work. Fist use match to see weather there is a match in the day of df and the day in datedata, then filter it
library (dplyr)
df <- df %>% mutate (Data.flag = match(Data,datedata$Data)) %>%
filter(!is.na(Data.flag) & action == 0)

Remove rows where at least x percent of the values are zero in both groups

Given the following dataframe i would like to remove all rows with at least x percent (e.g 50%) of values = 0 in at least one group.
For example if a row has less than 50% of values in both groups (control and treatment) it will be removed.
If the row has 50% of non zero value in group control(or treatment) and no values in the other group it will be kept since there is still one group with at least 50% values.
Hope it´s clear.
treatment control control treatment control treatment
row1 0 21 21 21 45 34
row2 0 21 78 321 93 0
row3 34 32 98 87 34 0
row4 75 21 12 54 45 34
row5 46 21 13 45 0 0
row6 85 21 87 45 0 23
row7 24 84 0 0 45 5
row8 87 21 0 98 87 76
row9 43 2 0 45 12 9
row10 12 12 0 0 23 0
Here below the dataframe
df <- structure(list(structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
2L), .Label = c("row1", "row10", "row2", "row3", "row4", "row5",
"row6", "row7", "row8", "row9"), class = "factor"), treatment = c(0L,
0L, 34L, 75L, 46L, 85L, 24L, 87L, 43L, 12L), control = c(21L,
21L, 32L, 21L, 21L, 21L, 84L, 21L, 2L, 12L), control = c(21L,
78L, 98L, 12L, 13L, 87L, 0L, 0L, 0L, 0L), treatment = c(21L,
321L, 87L, 54L, 45L, 45L, 0L, 98L, 45L, 0L), control = c(45L,
93L, 34L, 45L, 0L, 0L, 45L, 87L, 12L, 23L), treatment = c(34L,
0L, 0L, 34L, 0L, 23L, 5L, 76L, 9L, 0L)), .Names = c("", "treatment",
"control", "control", "treatment", "control", "treatment"), class = "data.frame", row.names = c(NA,
-10L))
Based on what you want, if a row has more than 3 "0", you want to remove the row.
rownames(df) <- df[,1]
df <- df[,-1]
df <- df[apply(df, 1, FUN = function(x){sum(x == 0)}) < 3,]
Row 10 is removed.

Sum correlated variables

I have a list of 200 variables and I want to sum those that are highly correlated.
Assuming this is my data
mydata <- structure(list(APPLE= c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L),
PEAR= c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L),
PLUM = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,20L, 10L, 10L, 10L, 10L, 10L),
BANANA= c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L),
LEMON = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)),
.Names = c("APPLE", "PEAR", "PLUM", "BANANA", "LEMON"),
class = "data.frame", row.names = c(NA,-16L))
I have found this code which I am not sure how to tweak in order to leverage it for my purpose
https://stackoverflow.com/a/39484353/4797853
var.corelation <- cor(as.matrix(mydata), method="pearson")
library(igraph)
# prevent duplicated pairs
var.corelation <- var.corelation*lower.tri(var.corelation)
check.corelation <- which(var.corelation>0.62, arr.ind=TRUE)
graph.cor <- graph.data.frame(check.corelation, directed = FALSE)
groups.cor <- split(unique(as.vector(check.corelation)), clusters(graph.cor)$membership)
lapply(groups.cor,FUN=function(list.cor){rownames(var.corelation)[list.cor]})
The output that I am looking for is 2 data frames as follow:
DF1
GROUP1 GROUP2
3 16
4 40
ETC..
The values are the sum of the values within a group
DF2
ORIGINAL_VAR GROUP
APPLE 1
PEAR 1
PLUM 2
BANANA 2
LEMON 2
Try this (assuming that you have only clustered into 2 groups):
DF1 <- cbind.data.frame(GROUP1=rowSums(mydata[,groups.cor[[1]]]),
GROUP2=rowSums(mydata[,groups.cor[[2]]]))
DF1
GROUP1 GROUP2
1 3 16
2 4 40
3 10 72
4 8 24
5 732 14
6 130 30
7 86 5
8 912 10
9 1752 3
10 156 2114
11 1374 22
12 756 14
13 756 114
14 68 22
15 106 14
16 84 14
DF2 <- NULL
for (i in 1:2) {
DF2 <- rbind(DF2,
cbind.data.frame(ORIGINAL_VAR=rownames(var.corelation)[groups.cor[[i]]],
GROUP=i))
}
DF2
ORIGINAL_VAR GROUP
1 PEAR 1
2 APPLE 1
3 BANANA 2
4 LEMON 2
5 PLUM 2

Resources