I have a dataset with two columns, "ID" and "CODCOM" with about 1 milion of rows. The first column "ID" contain duplicate values.
ID
CODCOM
10000
12
101010
14
201020
11
201020
11
201020
12
324032
43
324032
43
324032
43
405044
51
323032
21
I want to group "ID" duplicated values in different groups, then calculate the mode for each groups, after that I want to create a new column with the related mode values. Something like this:
ID
CODCOM
NEW_COL
10000
12
12
101010
14
14
201020
11
11
201020
11
11
201020
12
11
324032
43
43
324032
43
43
324032
43
43
405044
51
51
323032
21
43
How can I do this in an easy way?
Thank you so much in advance for any help provided.
A dplyr approach where I join the data to a version of itself with just the most-common CODCOM value (or first appearing with ties).
library(dplyr)
df1 %>%
left_join(
df1 %>%
group_by(ID) %>%
count(mode = CODCOM, sort = TRUE) %>%
slice(1),
by = "ID"
)
ID CODCOM mode n
1 10000 12 12 1
2 101010 14 14 1
3 201020 11 11 2
4 201020 11 11 2
5 201020 12 11 2
6 324032 43 43 3
7 324032 43 43 3
8 324032 43 43 3
9 405044 51 51 1
10 323032 21 21 1
Please find below one solution using the package data.table:
REPREX
Code
library(data.table)
# Function to compute mode
mode_compute <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
# Compute mode by ID
DT[ , MODE := mode_compute(CODCOM), by = ID]
Output
DT
#> ID CODCOM MODE
#> 1: 10000 12 12
#> 2: 101010 14 14
#> 3: 201020 11 11
#> 4: 201020 11 11
#> 5: 201020 12 11
#> 6: 324032 43 43
#> 7: 324032 43 43
#> 8: 324032 43 43
#> 9: 405044 51 51
#> 10: 323032 21 21
Data:
# Data
DT <- data.table(ID = c("10000", "101010", "201020", "201020", "201020",
"324032", "324032", "324032", "405044", "323032"),
CODCOM = c(12, 14, 11, 11, 12, 43, 43, 43, 51, 21))
DT
#> ID CODCOM
#> 1: 10000 12
#> 2: 101010 14
#> 3: 201020 11
#> 4: 201020 11
#> 5: 201020 12
#> 6: 324032 43
#> 7: 324032 43
#> 8: 324032 43
#> 9: 405044 51
#> 10: 323032 21
Created on 2021-10-11 by the reprex package (v0.3.0)
If I understand you correctly: we could group_by ID and then use summarise the mode of the mode function:
If you don't want to summarise you could use mutate instead (will keep all rows)!
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
df %>%
as_tibble() %>%
group_by(ID) %>%
summarise(NEW_COL = mode(CODCOM))
ID NEW_COL
<int> <int>
1 10000 12
2 101010 14
3 201020 11
4 323032 21
5 324032 43
6 405044 51
Base R solution:
# Option 1 using TarJae's mode function:
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
which.max(tabulate(x))
}
)
)
# Option two:
# Function to calculate the mode of a vector:
# mode_statistic => function()
mode_statistic <- function(x){
# Calculate the mode: res => vector
res <- names(
head(
sort(
table(
x
),
decreasing = TRUE
),
1
)
)
# Explicitly define returned object: character vector => env
return(res)
}
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
as.integer(
mode_statistic(x)
)
}
)
)
Related
I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000
I have a tibble as so:
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
I want to create a new tibble, where I want to lag some. I want to create new columns called prev+lagged_col_name, eg prev_a.
In my actual data, there are a lot of cols so I don't want to manually write it out. Additonally I only want to do it for some cols. In this eg, I have done it manually but wanted to know if there is a way to use a function to do it.
df_new <- df %>%
mutate(prev_a = lag(a),
prev_b = lag(b),
prev_d = lag(d))
Thanks for your help!
With the current dplyr version you can create new variable names with mutate_at, using a named list will take the name of the list as suffix. If you want it as a prefix as in your example you can use rename_at to correct the variable naming. With your real data, you need to adjust the vars() selection. For your example data matches("[a-c]") did work.
library(dplyr)
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x)))
#> # A tibble: 10 x 6
#> a b c a_prev b_prev c_prev
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x))) %>%
rename_at(vars(contains( "_prev") ), list( ~paste("prev", gsub("_prev", "", .), sep = "_")))
#> # A tibble: 10 x 6
#> a b c prev_a prev_b prev_c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
Created on 2020-04-29 by the reprex package (v0.3.0)
You could do this this way
df_new <- bind_cols(
df,
df %>% mutate_at(.vars = vars("a","b","c"), function(x) lag(x))
)
Names are a bit nasty but you can rename them check here. Or see #Bas comment to get the names with a suffix.
# A tibble: 10 x 6
a b c a1 b1 c1
<int> <int> <int> <int> <int> <int>
1 1 21 31 NA NA NA
2 2 22 32 1 21 31
3 3 23 33 2 22 32
4 4 24 34 3 23 33
5 5 25 35 4 24 34
6 6 26 36 5 25 35
7 7 27 37 6 26 36
8 8 28 38 7 27 37
9 9 29 39 8 28 38
10 10 30 40 9 29 39
If you have dplyr 1.0 you can use the new accross() function.
See some expamples from the docs, instead of mean you want lag
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))
df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))
Here is a test table:
df <- read.table(text="
str1 str2 name t y x
a yes bas 23 323 21
b no aasd 23 54 33
a no asd 2 43 23
b yes hggf 43 123 55
b no jgd 1 12 11
b yes qw 32 12 12
a yes rrrr 45 22 32
a no ggg 121 11 43
",
header = TRUE)
With help here we can get such subtotals
library(janitor)
library(purrr)
library(dplyr)
df<-df %>%
split(.[,"str1"]) %>% ## splits each change in cyl into a list of dataframes
map_df(., janitor::adorn_totals)
But my question is how to get also sub totals inside each group of column str1 depending on group inside of str2. It's needed a dataframe like this:
Would appreciate any help
P.S it is vital x column to be in descending order in each group
We can do the split by two columns and then change the name of the 'Total' based on the values in 'str1', 'str2'
library(dplyr)
library(janitor)
library(purrr)
library(stringr)
df %>%
group_split(str1, str2) %>%
map_dfr(~ .x %>%
janitor::adorn_totals(.) %>%
mutate(str1 = replace(str1, n(), str_c(str1[n()], "_",
first(str1), "_", first(str2)))))
Alternatively, using the same syntax than for your first split, you can do:
library(janitor)
library(purrr)
library(dplyr)
df %>% arrange(x) %>%
split(.[,c("str2","str1")]) %>%
map_df(., janitor::adorn_totals)
str1 str2 name t y x
a no asd 2 43 23
a no ggg 121 11 43
Total - - 123 54 66
a yes bas 23 323 21
a yes rrrr 45 22 32
Total - - 68 345 53
b no jgd 1 12 11
b no aasd 23 54 33
Total - - 24 66 44
b yes qw 32 12 12
b yes hggf 43 123 55
Total - - 75 135 67
If you don't mind the location of the "total" rows being a little different, you can use data.table::rollup. Rows with NA are totals for the group identified by the values of the non-NA columns.
library(data.table)
setDT(df)
group_vars <- head(names(df), 3)
df_ru <-
rollup(df, j = lapply(.SD, sum), by = group_vars,
.SDcols = tail(names(df), 3))
setorderv(df_ru, group_vars)[-1]
#> str1 str2 name t y x
#> 1: a <NA> <NA> 191 399 119
#> 2: a no <NA> 123 54 66
#> 3: a no asd 2 43 23
#> 4: a no ggg 121 11 43
#> 5: a yes <NA> 68 345 53
#> 6: a yes bas 23 323 21
#> 7: a yes rrrr 45 22 32
#> 8: b <NA> <NA> 99 201 111
#> 9: b no <NA> 24 66 44
#> 10: b no aasd 23 54 33
#> 11: b no jgd 1 12 11
#> 12: b yes <NA> 75 135 67
#> 13: b yes hggf 43 123 55
#> 14: b yes qw 32 12 12
Created on 2021-06-05 by the reprex package (v2.0.0)
I have the following dataframe in R.
df <- data.frame(
"DateValue" = c("2016-07-01", "2016-07-02", "2016-07-03", "2016-07-04", "2016-07-05", "2016-07-06","2017-07-01", "2017-07-02", "2017-07-03", "2017-07-04", "2017-07-05", "2017-07-06", "2018-07-01", "2018-07-02", "2018-07-03", "2018-07-04", "2018-07-05", "2018-07-06"),
"Age1" = seq(1:18),
"Age2" = c(seq(14,36,2), rep(NA, 6)),
"Age3" = c(seq(45,50),rep(NA, 12))
)
DateValue Age1 Age2 Age3
# 1 2016-07-01 1 14 45
# 2 2016-07-02 2 16 46
# 3 2016-07-03 3 18 47
# 4 2016-07-04 4 20 48
# 5 2016-07-05 5 22 49
# 6 2016-07-06 6 24 50
# 7 2017-07-01 7 26 NA
# 8 2017-07-02 8 28 NA
# 9 2017-07-03 9 30 NA
# 10 2017-07-04 10 32 NA
# 11 2017-07-05 11 34 NA
# 12 2017-07-06 12 36 NA
# 13 2018-07-01 13 NA NA
# 14 2018-07-02 14 NA NA
# 15 2018-07-03 15 NA NA
# 16 2018-07-04 16 NA NA
# 17 2018-07-05 17 NA NA
# 18 2018-07-06 18 NA NA
I am trying to come up with a code that aligns the data from the "Age2" and "Age3" columns so that the dates line up. Below is the output I am looking for:
df <- data.frame(
"DateValue" = c("07-01", "07-02", "07-03", "07-04", "07-05", "07-06"),
"Age1" = seq(13:18),
"Age2" = seq(26,36,2),
"Age3" = seq(45,50)
)
# DateValue Age1 Age2 Age3
# 1 07-01 13 26 45
# 2 07-02 14 28 46
# 3 07-03 15 30 47
# 4 07-04 16 32 48
# 5 07-05 17 34 49
# 6 07-06 18 36 50
I am essentially keeping all the dates and values for my current year (2018) and matching them with the dates for the previous years. Note that I may have more dates in my previous year. But I need to drop all the rows that do not have any data for the current year. I reviewed the following thread on SO on rearranging the dataframe but the context is quite different than my situation.
R Data Rearrange
I tried looking at the R reshape package but haven't had any luck. Any suggestions/ pointers would be appreciated.
Here's a non-robust solution:
df$DateValue = format(as.Date(df$DateValue), '%m-%d')
Age3_non_NA <- sum(!is.na(df[['Age3']]))
df <- as.data.frame(lapply(df, function(l) tail(na.omit(l), Age3_non_NA)))
df
DateValue Age1 Age2 Age3
1 07-01 13 26 45
2 07-02 14 28 46
3 07-03 15 30 47
4 07-04 16 32 48
5 07-05 17 34 49
6 07-06 18 36 50
And here's a more robust solution that includes gather and spread:
library(tidyr)
library(dplyr)
library(lubridate)
df%>%
mutate(DateValue = as.Date(DateValue),
Year = year(DateValue),
Mon_Day = format(DateValue, '%m-%d'))%>%
select(-DateValue)%>%
gather(Age, val, -Year, -Mon_Day, na.rm = T)%>%
group_by(Age, Mon_Day)%>%
filter(Year == max(Year))%>%
ungroup()%>%
select(-Year)%>%
spread(Age, val)
# A tibble: 6 x 4
Mon_Day Age1 Age2 Age3
<chr> <dbl> <dbl> <dbl>
1 07-01 13 26 45
2 07-02 14 28 46
3 07-03 15 30 47
4 07-04 16 32 48
5 07-05 17 34 49
6 07-06 18 36 50
Here's one way to do it. This could definitely be refactored, but it works.
library(dplyr)
# DateValue is a factor; convert to date format
df$DateValue <- as.Date(as.character(df$DateValue), format="%Y-%m-%d")
# grab the month and day from DateValue, sort by Age1
df <- df %>%
mutate(MonthAndDay = format(DateValue, "%m-%d")) %>%
arrange(desc(Age1))
# get vector of dates
dates <- df$MonthAndDay[which(!duplicated(df$MonthAndDay))]
# define age columns
agecols <- c("Age1","Age2","Age3")
# initialize empty df to be populated in loop
temp <- data.frame(MonthAndDay = dates)
# for each column, get values that a) are in the target dates, b) aren't NA, and c) only get the first ones (not duplicates--that's why we arranged by Age1 before). Select the values and add them as a new column to the new dataframe.
for (col in agecols) {
temp_col <- filter(df, MonthAndDay %in% dates & !is.na(df[,col]))
temp_col <- filter(temp_col[-which(duplicated(df$MonthAndDay)), ]) %>%
select(col)
temp[,col] <- temp_col
}
temp %>% arrange(MonthAndDay)
# MonthAndDay Age1 Age2 Age3
# 1 07-01 13 26 45
# 2 07-02 14 28 46
# 3 07-03 15 30 47
# 4 07-04 16 32 48
# 5 07-05 17 34 49
# 6 07-06 18 36 50
Using base R, here is one way
#Get age columns
age_cols <- grep("^Age", names(df))
#Convert date to actual object
df$DateValue <- as.Date(df$DateValue)
#Get year from date
df$year <- as.integer(format(df$DateValue, "%Y"))
#Get month-date from Date
df$month_date <- format(df$DateValue, "%m-%d")
#Select dates which are present in max year
subset_date <- with(df, month_date[year == max(year)])
#For each age_cols select the non NA values which match subset_date
cbind.data.frame(DateValue = subset_date,
sapply(df[age_cols], function(x) {
x <- x[order(df$year, decreasing = TRUE)]
x <- x[!is.na(x)]
x[match(subset_date, df$month_date)]
}))
# DateValue Age1 Age2 Age3
#1 07-01 13 26 45
#2 07-02 14 28 46
#3 07-03 15 30 47
#4 07-04 16 32 48
#5 07-05 17 34 49
#6 07-06 18 36 50
I have several tables( two in particular example) with the same structure. I would like to join on ID_Position & ID_Name and generate the sum of January and February in the output table (There might be some NAs in both columns)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,23,2,22,24,26,28)
feb<-c(10,30,20,12,NA,3,NA,22,24,26)
df1 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,NA,NA,22,24,26,28)
feb<-c(10,30,20,12,23,3,3,22,24,26)
df2 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
I tried the inner and the full join. But that seems to work as I desire:
library(plyr)
test<-join(df1, df2, by =c("ID_Position","ID_Name") , type = "inner", match = "all")
Desired output:
ID_Position Position ID_Name Name jan feb
1 A 11 Michael 20 20
2 B 12 Tobi 40 60
3 C 13 Chris 60 40
4 D 14 Hans 44 24
5 E 15 Likas 23 23
6 H 16 Martin 2 6
7 I 17 Seba 44 22
8 J 18 Li 48 44
9 X 19 Sha 52 48
10 W 20 Susi 56 52
Your desired output doesn't seem entirely correct, but here's an example of how you can do this efficiently using data.table binary join which allows you to efficiently run functions while joining using the by = .EACHI option
library(data.table)
setkey(setDT(df1), ID_Position, ID_Name, Name)
setkey(setDT(df2), ID_Position, ID_Name, Name)
df2[df1, .(jan = sum(jan, i.jan, na.rm = TRUE),
feb = sum(feb, i.feb, na.rm = TRUE)),
by = .EACHI]
# ID_Position ID_Name Name jan feb
# 1: 1 11 Michael 20 20
# 2: 2 12 Tobi 40 60
# 3: 3 13 Chris 60 40
# 4: 4 14 Hans 44 24
# 5: 5 15 Likas 46 0
# 6: 6 16 Martin 0 6
# 7: 7 17 Seba 44 0
# 8: 8 18 Li 48 44
# 9: 9 19 Sha 52 48
# 10: 10 20 Susi 56 52