I would like to sum the values of Var1 and Var2 for each row and produce a new column titled Vars which gives the total of Var1 and Var2. I would then like to do the same for Col1 and Col2 and have a new column titled Cols which gives the sum of Col1 and Col2. How do I write the code for this? Thanks in advance.
df
ID Var1 Var2 Col1 Col2
1 34 22 34 24
2 3 25 54 65
3 87 68 14 78
4 66 98 98 100
5 55 13 77 2
Expected outcome would be the following:
df
ID Var1 Var2 Col1 Col2 Vars Cols
1 34 22 34 24 56 58
2 3 25 54 65 28 119
3 87 68 14 78 155 92
4 66 98 98 100 164 198
5 55 13 77 2 68 79
Assuming that column ID is irrelevant (no groups) and you are happy to specify column names (solution hard-coded, not generic).
A base R solution:
df$Vars <- rowSums(df1[, c("Var1", "Var2")])
df$Cols <- rowSums(df1[, c("Col1", "Col2")])
A tidyverse solution:
library(dplyr)
library(purrr)
df %>% mutate(Vars = map2_int(Var1, Var2, sum),
Cols = map2_int(Col1, Col2, sum))
# or just
df %>% mutate(Vars = Var1 + Var2,
Cols = Col1 + Col2)
There are many different ways to do this. With
library(dplyr)
df = df %>% #input dataframe
group_by(ID) %>% #do it for every ID, so every row
mutate( #add columns to the data frame
Vars = Var1 + Var2, #do the calculation
Cols = Col1 + Col2
)
But there are many other ways, eg with apply-functions etc. I suggest to read about the tidyverse.
Another dplyr way is to use helper functions starts_with to select columns and then use rowSums to sum those columns.
library(dplyr)
df$Vars <- df %>% select(starts_with("Var")) %>% rowSums()
df$Cols <- df %>% select(starts_with("Col")) %>% rowSums()
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
A solution summing up all columns witch have the same name and end with numbers using gsub in base:
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[-1]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[grep(x, tt)+1])}))
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
Or an even more general solution:
idx <- grep('[[:digit:]]', names(df))
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[idx]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[idx[grep(x, tt)]])}))
Related
I would like to replace some column values in a df based on column in another data frame
This is the head of the first df:
df1
A tibble: 253 x 2
id sum_correct
<int> <dbl>
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 16
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement
df 2
A tibble: 14 x 2
id sum_correct
<int> <dbl>
1 866103 61
2 866124 79
3 866152 85
4 867101 24
5 867140 76
6 867146 51
7 867152 56
8 867200 50
9 867209 97
10 879657 56
11 879680 61
12 879683 58
13 879693 77
14 881451 57
how I can achieve this in R studio? thanks for the help in advance.
You can make an update join using match to find where id matches and remove non matches (NA) with which:
idx <- match(df1$id, df2$id)
idxn <- which(!is.na(idx))
df1$sum_correct[idxn] <- df2$sum_correct[idx[idxn]]
df1
id sum_correct
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 61
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
you can do a left_join and then use coalesce:
library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>%
mutate(sum_correct_final = coalesce(sum_correct_2, sum_correct_1))
The new column sum_correct_final contains the value from df2 if it exists and from df1 if a corresponding entry from df2 does not exist.
Here is a test table:
df <- read.table(text="
str1 str2 name t y x
a yes bas 23 323 21
b no aasd 23 54 33
a no asd 2 43 23
b yes hggf 43 123 55
b no jgd 1 12 11
b yes qw 32 12 12
a yes rrrr 45 22 32
a no ggg 121 11 43
",
header = TRUE)
With help here we can get such subtotals
library(janitor)
library(purrr)
library(dplyr)
df<-df %>%
split(.[,"str1"]) %>% ## splits each change in cyl into a list of dataframes
map_df(., janitor::adorn_totals)
But my question is how to get also sub totals inside each group of column str1 depending on group inside of str2. It's needed a dataframe like this:
Would appreciate any help
P.S it is vital x column to be in descending order in each group
We can do the split by two columns and then change the name of the 'Total' based on the values in 'str1', 'str2'
library(dplyr)
library(janitor)
library(purrr)
library(stringr)
df %>%
group_split(str1, str2) %>%
map_dfr(~ .x %>%
janitor::adorn_totals(.) %>%
mutate(str1 = replace(str1, n(), str_c(str1[n()], "_",
first(str1), "_", first(str2)))))
Alternatively, using the same syntax than for your first split, you can do:
library(janitor)
library(purrr)
library(dplyr)
df %>% arrange(x) %>%
split(.[,c("str2","str1")]) %>%
map_df(., janitor::adorn_totals)
str1 str2 name t y x
a no asd 2 43 23
a no ggg 121 11 43
Total - - 123 54 66
a yes bas 23 323 21
a yes rrrr 45 22 32
Total - - 68 345 53
b no jgd 1 12 11
b no aasd 23 54 33
Total - - 24 66 44
b yes qw 32 12 12
b yes hggf 43 123 55
Total - - 75 135 67
If you don't mind the location of the "total" rows being a little different, you can use data.table::rollup. Rows with NA are totals for the group identified by the values of the non-NA columns.
library(data.table)
setDT(df)
group_vars <- head(names(df), 3)
df_ru <-
rollup(df, j = lapply(.SD, sum), by = group_vars,
.SDcols = tail(names(df), 3))
setorderv(df_ru, group_vars)[-1]
#> str1 str2 name t y x
#> 1: a <NA> <NA> 191 399 119
#> 2: a no <NA> 123 54 66
#> 3: a no asd 2 43 23
#> 4: a no ggg 121 11 43
#> 5: a yes <NA> 68 345 53
#> 6: a yes bas 23 323 21
#> 7: a yes rrrr 45 22 32
#> 8: b <NA> <NA> 99 201 111
#> 9: b no <NA> 24 66 44
#> 10: b no aasd 23 54 33
#> 11: b no jgd 1 12 11
#> 12: b yes <NA> 75 135 67
#> 13: b yes hggf 43 123 55
#> 14: b yes qw 32 12 12
Created on 2021-06-05 by the reprex package (v2.0.0)
This is my dataframe:
set.seed(1)
df <- data.frame(A = 1:50, B = 11:60, c = 21:70)
head(df)
df.final <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
I want to delete the columns that its last 5 values are not filled by NA. That is, only the columns that has values in the rows from 46 to 50 remain. the columns which the last 5 values has one or more NA´s will be deleted.
Is it possible do this with dplyr?
Any help?
dplyr::select() accepts integer column positions. We can use that to achieve this -
result <- df.final %>% select(., which(!is.na(colSums(tail(., 5)))))
head(result)
A B
1 1 11
2 2 NA
3 3 13
4 NA 14
5 5 15
6 NA 16
Shree beat me to it, but it might come in handy:
> df.final %>% tail
A B c
45 45 55 65
46 46 NA 66
47 47 57 67
48 NA 58 68
49 NA 59 69
50 NA 60 NA
> df.final %>%
+ select_if(~ !any(is.na(tail(., n = 1)))) %>%
+ tail()
B
45 55
46 NA
47 57
48 58
49 59
50 60
Just change the n above to the number of last NAs that you want.
I have a dataset with 8 variables,when I run dplyr with syntax below, my output dataframe only has the variables I have used in the dplyr code, while I want all variables
ShowID<-MyData %>%
group_by(id) %>%
summarize (count=n()) %>%
filter(count==min(count))
ShowID
So my output will have two variables - ID and Count. How do I get rest of my variables in the new dataframe? Why is this happening, what am I clueless about here?
> ncol(ShowID)
[1] 2
> ncol(MyData)
[1] 8
MYDATA
key ID v1 v2 v3 v4 v5 v6
0-0-70cf97 1 89 20 30 45 55 65
3ad4893b8c 1 4 5 45 45 55 65
0-0-70cf97d7 2 848 20 52 66 56 56
0-0-70cf 2 54 4 846 65 5 5
0-0-793b8c 3 56454 28 6 4 5 65
0-0-70cf98 2 8 4654 30 65 6 21
3ad4893b8c 2 89 66 518 156 16 65
0-0-70cf97d8 3 89 20 161 1 55 45465
0-0-70cf 5 89 79 48 45 55 456
0-0-793b8c 5 89 20 48 545 654 4
0-0-70cf99 6 9 20 30 45 55 65
DESIRED
key ID count v1 v2 v3 v4 v5 v6
0-0-70cf99 6 1 9 20 30 45 55 65
RESULT FROM CODE
ID count
6 1
You can use the base R ave method to calculate number of rows in each group (ID) and then select those group which has minimum rows.
num_rows <- ave(MyData$v1, MyData$ID, FUN = length)
MyData[which(num_rows == min(num_rows)), ]
# key ID v1 v2 v3 v4 v5 v6
#11 0-0-70cf99 6 9 20 30 45 55 65
You could also use which.min in this case to avoid one step however, in case of multiple minimum values it would fail hence, I have used which.
No need to summarize:
ShowID <- MyData %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == min(count))
I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!