Here is a test table:
df <- read.table(text="
str1 str2 name t y x
a yes bas 23 323 21
b no aasd 23 54 33
a no asd 2 43 23
b yes hggf 43 123 55
b no jgd 1 12 11
b yes qw 32 12 12
a yes rrrr 45 22 32
a no ggg 121 11 43
",
header = TRUE)
With help here we can get such subtotals
library(janitor)
library(purrr)
library(dplyr)
df<-df %>%
split(.[,"str1"]) %>% ## splits each change in cyl into a list of dataframes
map_df(., janitor::adorn_totals)
But my question is how to get also sub totals inside each group of column str1 depending on group inside of str2. It's needed a dataframe like this:
Would appreciate any help
P.S it is vital x column to be in descending order in each group
We can do the split by two columns and then change the name of the 'Total' based on the values in 'str1', 'str2'
library(dplyr)
library(janitor)
library(purrr)
library(stringr)
df %>%
group_split(str1, str2) %>%
map_dfr(~ .x %>%
janitor::adorn_totals(.) %>%
mutate(str1 = replace(str1, n(), str_c(str1[n()], "_",
first(str1), "_", first(str2)))))
Alternatively, using the same syntax than for your first split, you can do:
library(janitor)
library(purrr)
library(dplyr)
df %>% arrange(x) %>%
split(.[,c("str2","str1")]) %>%
map_df(., janitor::adorn_totals)
str1 str2 name t y x
a no asd 2 43 23
a no ggg 121 11 43
Total - - 123 54 66
a yes bas 23 323 21
a yes rrrr 45 22 32
Total - - 68 345 53
b no jgd 1 12 11
b no aasd 23 54 33
Total - - 24 66 44
b yes qw 32 12 12
b yes hggf 43 123 55
Total - - 75 135 67
If you don't mind the location of the "total" rows being a little different, you can use data.table::rollup. Rows with NA are totals for the group identified by the values of the non-NA columns.
library(data.table)
setDT(df)
group_vars <- head(names(df), 3)
df_ru <-
rollup(df, j = lapply(.SD, sum), by = group_vars,
.SDcols = tail(names(df), 3))
setorderv(df_ru, group_vars)[-1]
#> str1 str2 name t y x
#> 1: a <NA> <NA> 191 399 119
#> 2: a no <NA> 123 54 66
#> 3: a no asd 2 43 23
#> 4: a no ggg 121 11 43
#> 5: a yes <NA> 68 345 53
#> 6: a yes bas 23 323 21
#> 7: a yes rrrr 45 22 32
#> 8: b <NA> <NA> 99 201 111
#> 9: b no <NA> 24 66 44
#> 10: b no aasd 23 54 33
#> 11: b no jgd 1 12 11
#> 12: b yes <NA> 75 135 67
#> 13: b yes hggf 43 123 55
#> 14: b yes qw 32 12 12
Created on 2021-06-05 by the reprex package (v2.0.0)
Related
I have two dataframe(df1, df2) with different size .
The df2 is subset of df1.
Date columns are formatted as date (y.m.d).
(Location and location_id have a one to one mapping.)
tl;dr my goal is to use df2’s date to filter any date that comes after that for its specific location.
df1
date
location
location_id
y
x
22.02.02
A
1
26
78
22.02.02
B
2
45
67
22.02.02
C
3
46
57
22.02.02
D
4
27
89
22.02.01
A
1
37
67
22.02.01
B
2
82
23
22.02.01
C
3
56
68
22.02.01
D
4
67
90
22.01.31
A
1
61
37
22.01.31
B
2
90
65
22.01.31
C
3
43
34
22.01.31
D
4
12
23
22.01.30
A
1
38
48
22.01.30
B
2
57
53
22.01.30
C
3
75
95
22.01.30
D
4
76
19
df2
date
location
location_id
y
x
22.02.01
A
1
37
67
22.02.02
B
2
45
67
22.01.30
C
3
75
95
22.01.31
D
4
12
23
In df2 each location corresponds to one and only one date. I want to use this location to date map from df2, and apply a function to df1. This function should set x and y columns to zero (or na) for any date that comes after specified date for that location on df2.
So the output df3 would be like this:
df3
date
location
location_id
y
x
22.02.02
A
1
22.02.02
B
2
45
67
22.02.02
C
3
22.02.02
D
4
22.02.01
A
1
37
67
22.02.01
B
2
82
23
22.02.01
C
3
22.02.01
D
4
22.01.31
A
1
61
37
22.01.31
B
2
90
65
22.01.31
C
3
22.01.31
D
4
12
23
22.01.30
A
1
38
48
22.01.30
B
2
57
53
22.01.30
C
3
75
95
22.01.30
D
4
76
19
I do not know how to solve this kind of questions, what is the most efficient way to solve this, is it using dplyr or for loop?!
I am approaching this with using dplyr ,groupby and mutate. Something Like this (I do not get result that I want).
Here my code which does not give me the result:
df1 %>%
group_by(location, id_location) %>%
mutate(date = df2$date, y= df2$y, x = df2$x)
I am new to R and using dplyr.
Here's one approach. First, I make a version of df2 with the dates stored as dates, which'll make it simpler to use them for calculations, and call that date_limits. (It's not strictly necessary here since your date strings' alphabetical sorting will also be chronological, but I think it's good practice.) I don't need the x/y values since they're in df1 already.
library(tidyverse); library(lubridate)
date_limits <- df2 %>%
mutate(max_date = ymd(date)) %>%
select(max_date, location, location_id)
Then we can join those dates onto df1 using dplyr::left_join, sort of like vlookup in excel, or merge in base R. It will by default use all the common variables (in this case location and location_id) to bring in the max_date for that location.
Then I change y and x using mutate(across(... so that if the max_date we pulled in is later than the date, change it to NA, otherwise leave it as is.
df1 %>%
mutate(date = ymd(date)) %>%
left_join(date_limits) %>%
mutate(across(y:x, ~if_else(date > max_date, NA_integer_, .)))
Result
Joining, by = c("location", "location_id")
date location location_id y x max_date
1 2022-02-02 A 1 NA NA 2022-02-01
2 2022-02-02 B 2 45 67 2022-02-02
3 2022-02-02 C 3 NA NA 2022-01-30
4 2022-02-02 D 4 NA NA 2022-01-31
5 2022-02-01 A 1 37 67 2022-02-01
6 2022-02-01 B 2 82 23 2022-02-02
7 2022-02-01 C 3 NA NA 2022-01-30
8 2022-02-01 D 4 NA NA 2022-01-31
9 2022-01-31 A 1 61 37 2022-02-01
10 2022-01-31 B 2 90 65 2022-02-02
11 2022-01-31 C 3 NA NA 2022-01-30
12 2022-01-31 D 4 12 23 2022-01-31
13 2022-01-30 A 1 38 48 2022-02-01
14 2022-01-30 B 2 57 53 2022-02-02
15 2022-01-30 C 3 75 95 2022-01-30
16 2022-01-30 D 4 76 19 2022-01-31
I have a dataset with two columns, "ID" and "CODCOM" with about 1 milion of rows. The first column "ID" contain duplicate values.
ID
CODCOM
10000
12
101010
14
201020
11
201020
11
201020
12
324032
43
324032
43
324032
43
405044
51
323032
21
I want to group "ID" duplicated values in different groups, then calculate the mode for each groups, after that I want to create a new column with the related mode values. Something like this:
ID
CODCOM
NEW_COL
10000
12
12
101010
14
14
201020
11
11
201020
11
11
201020
12
11
324032
43
43
324032
43
43
324032
43
43
405044
51
51
323032
21
43
How can I do this in an easy way?
Thank you so much in advance for any help provided.
A dplyr approach where I join the data to a version of itself with just the most-common CODCOM value (or first appearing with ties).
library(dplyr)
df1 %>%
left_join(
df1 %>%
group_by(ID) %>%
count(mode = CODCOM, sort = TRUE) %>%
slice(1),
by = "ID"
)
ID CODCOM mode n
1 10000 12 12 1
2 101010 14 14 1
3 201020 11 11 2
4 201020 11 11 2
5 201020 12 11 2
6 324032 43 43 3
7 324032 43 43 3
8 324032 43 43 3
9 405044 51 51 1
10 323032 21 21 1
Please find below one solution using the package data.table:
REPREX
Code
library(data.table)
# Function to compute mode
mode_compute <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
# Compute mode by ID
DT[ , MODE := mode_compute(CODCOM), by = ID]
Output
DT
#> ID CODCOM MODE
#> 1: 10000 12 12
#> 2: 101010 14 14
#> 3: 201020 11 11
#> 4: 201020 11 11
#> 5: 201020 12 11
#> 6: 324032 43 43
#> 7: 324032 43 43
#> 8: 324032 43 43
#> 9: 405044 51 51
#> 10: 323032 21 21
Data:
# Data
DT <- data.table(ID = c("10000", "101010", "201020", "201020", "201020",
"324032", "324032", "324032", "405044", "323032"),
CODCOM = c(12, 14, 11, 11, 12, 43, 43, 43, 51, 21))
DT
#> ID CODCOM
#> 1: 10000 12
#> 2: 101010 14
#> 3: 201020 11
#> 4: 201020 11
#> 5: 201020 12
#> 6: 324032 43
#> 7: 324032 43
#> 8: 324032 43
#> 9: 405044 51
#> 10: 323032 21
Created on 2021-10-11 by the reprex package (v0.3.0)
If I understand you correctly: we could group_by ID and then use summarise the mode of the mode function:
If you don't want to summarise you could use mutate instead (will keep all rows)!
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
df %>%
as_tibble() %>%
group_by(ID) %>%
summarise(NEW_COL = mode(CODCOM))
ID NEW_COL
<int> <int>
1 10000 12
2 101010 14
3 201020 11
4 323032 21
5 324032 43
6 405044 51
Base R solution:
# Option 1 using TarJae's mode function:
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
which.max(tabulate(x))
}
)
)
# Option two:
# Function to calculate the mode of a vector:
# mode_statistic => function()
mode_statistic <- function(x){
# Calculate the mode: res => vector
res <- names(
head(
sort(
table(
x
),
decreasing = TRUE
),
1
)
)
# Explicitly define returned object: character vector => env
return(res)
}
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
as.integer(
mode_statistic(x)
)
}
)
)
I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000
I have a data frame like this in wide format
setseed(1)
df = data.frame(item=letters[1:6], field1a=sample(6,6),field1b=sample(60,6),
field1c=sample(200,6),field2a=sample(6,6),field2b=sample(60,6),
field2c=sample(200,6))
what would be the best way to stack all a columns together and all b together and all c together like this
items fielda fieldb fieldc
a 2 52 121
a 1 44 57
using base R:
cbind(item=df$item,unstack(transform(stack(df,-1),ind=sub("\\d+","",ind))))
item fielda fieldb fieldc
1 a 2 57 138
2 b 6 39 77
3 c 3 37 153
4 d 4 4 99
5 e 1 12 141
6 f 5 10 194
7 a 3 17 97
8 b 4 23 120
9 c 5 1 98
10 d 1 22 37
11 e 2 49 163
12 f 6 19 131
Or you can use the reshape function in Base R:
reshape(df,varying = split(names(df)[-1],rep(1:3,2)),idvar = "item",direction = "long")
item time field1a field1b field1c
a.1 a 1 2 57 138
b.1 b 1 6 39 77
c.1 c 1 3 37 153
d.1 d 1 4 4 99
e.1 e 1 1 12 141
f.1 f 1 5 10 194
a.2 a 2 3 17 97
b.2 b 2 4 23 120
c.2 c 2 5 1 98
d.2 d 2 1 22 37
e.2 e 2 2 49 163
f.2 f 2 6 19 131
You can also decide to separate the name of the dataframe by yourself then format it:
names(df)=sub("(\\d)(.)","\\2.\\1",names(df))
reshape(df,varying= -1,idvar = "item",direction = "long")
If we are using tidyverse, then gather into 'long' format, do some rearrangements with the column name and spread
library(tidyverse)
out <- df %>%
gather(key, val, -item) %>%
mutate(key1 = gsub("\\d+", "", key),
key2 = gsub("\\D+", "", key)) %>%
select(-key) %>%
spread(key1, val) %>%
select(-key2)
head(out, 2)
# item fielda fieldb fieldc
#1 a 2 57 138
#2 a 3 17 97
Or a similar option is melt/dcast from data.table, where we melt into 'long' format, substring the 'variable' and then dcast to 'wide' format
library(data.table)
dcast(melt(setDT(df), id.var = "item")[, variable := sub("\\d+", "", variable)
], item + rowid(variable) ~ variable, value.var = 'value')[
, variable := NULL][]
# item fielda fieldb fieldc
# 1: a 2 57 138
# 2: a 3 17 97
# 3: b 6 39 77
# 4: b 4 23 120
# 5: c 3 37 153
# 6: c 5 1 98
# 7: d 4 4 99
# 8: d 1 22 37
# 9: e 1 12 141
#10: e 2 49 163
#11: f 5 10 194
#12: f 6 19 131
NOTE: Should also work when the lengths are not balanced for each cases
data
set.seed(1)
df = data.frame(item = letters[1:6],
field1a=sample(6,6),
field1b=sample(60,6),
field1c=sample(200,6),
field2a=sample(6,6),
field2b=sample(60,6),
field2c=sample(200,6))
I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!