My data frame looks like this:
View(df)
Product Value
a 2
b 4
c 3
d 10
e 15
f 5
g 6
h 4
i 50
j 20
k 35
l 25
m 4
n 6
o 30
p 4
q 40
r 5
s 3
t 40
I want to find the 9 most expensive products and summaries the rest. It should look like this:
Product Value
d 10
e 15
i 50
j 20
k 35
l 25
o 30
q 40
t 40
rest 46
Rest is the sum of the other 11 products.
I tried it with summaries, but it didn't work:
new <- df %>%
group_by(Product)%>%
summarise((Value > 10) = sum(Value)) %>%
ungroup()
We can use dplyr::row_number to effectively rank the observations after using arrange to order the data by Value. Then, we augment the Product column so that any values that aren't in the top 9 are coded as Rest. Finally, we group by the updated Product and take the sum using summarise
dat %>%
arrange(desc(Value)) %>%
mutate(RowNum = row_number(),
Product = ifelse(RowNum <= 9, Product, 'Rest')) %>%
group_by(Product) %>%
summarise(Value = sum(Value))
# A tibble: 10 × 2
Product Value
<chr> <int>
1 d 10
2 e 15
3 i 50
4 j 20
5 k 35
6 l 25
7 o 30
8 q 40
9 Rest 46
10 t 40
data
dat <- structure(list(Product = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t"
), Value = c(2L, 4L, 3L, 10L, 15L, 5L, 6L, 4L, 50L, 20L, 35L,
25L, 4L, 6L, 30L, 4L, 40L, 5L, 3L, 40L)), .Names = c("Product",
"Value"), class = "data.frame", row.names = c(NA, -20L))
Another way with dplyr would be to create the outcome with do. The code becomes a bit hard to read since you need to use .$, yet you can avoid ifelse/if_else. After arranging the order by Value, you can create two vectors. One with the first nine product names and "rest". The other with the first nine values and the sum of the value of the other values. You directly create a data frame using do.
df %>%
arrange(desc(Value)) %>%
do(data.frame(Product = c(as.character(.$Product[1:9]), "Rest"),
Value = c(.$Value[1:9], sum(.$Value[10:length(.$Value)]))))
# Product Value
#1 i 50
#2 q 40
#3 t 40
#4 k 35
#5 o 30
#6 l 25
#7 j 20
#8 e 15
#9 d 10
#10 Rest 46
Here is one option using data.table
library(data.table)
setDT(df)[, i1 := .I][order(desc(Value))
][-(seq_len(9)), Product := 'rest'
][, .(Value = sum(Value), i1=i1[1L]), Product
][order(Product=='rest', i1)][, i1 := NULL][]
# Product Value
#1: d 10
#2: e 15
#3: i 50
#4: j 20
#5: k 35
#6: l 25
#7: o 30
#8: q 40
#9: t 40
#10: rest 46
Related
I have data that I separated within a dataframe by item description and type. The separations are blank rows but I would like to fill the blank rows in with the sums of numeric values by each description and, if possible, add another blank row below the sums. Preferably, I would not need to sum the sections of data that only contain one row - see variable desc "a" but not a big deal if I do get a sum there.
This is an example of what I have now:
desc type xvalue yvalue
1 a z 16 1
2
3 b y 17 2
4 b y 18 3
5
6 c x 19 4
7 c x 20 5
8 c x 21 6
9
10 d x 22 7
11 d x 23 8
12
13 d y 24 9
14 d y 25 10
What I am looking for is output that looks similar to this.
desc type xvalue yvalue
1 a z 16 1
2
3 b y 17 2
4 b y 18 3
5 35 5
6
7 c x 19 4
8 c x 20 5
9 c x 21 6
10 40 15
11
12 d x 22 7
13 d x 23 8
14 45 15
15
16 d y 24 9
17 d y 25 10
18 49 19
I found an answer on how to do this in a column but not a row. Adding column of summed data by group with empty rows with R
I used acylam's dplyr answer to this question Add blank rows in between existing rows to create the empty rows. I changed the code slightly to fit my data better so my code is:
library(dplyr)
df %>%
split(df$id, df$group) %>%
Map(rbind, ., "") %>%
do.call(rbind, .)
I am hoping I can just add options to the do.call(rbind...) dplyr code I have above.
Depending on how your data is organized we could do it this way:
Assuming empty rows are NA's (if not for example they are blank we can make them NA)
we use group_split() after grouping, getting a list, then iterate with map_df over the list using janitor's adorn_totals
library(dplyr)
library(janitor)
df %>%
na.omit() %>% # maybe you don't need this line
group_by(desc, type) %>%
group_split() %>%
purrr::map_df(., janitor::adorn_totals)
desc type xvalue yvalue
a z 16 1
Total - 16 1
b y 17 2
b y 18 3
Total - 35 5
c x 19 4
c x 20 5
c x 21 6
Total - 60 15
d x 22 7
d x 23 8
Total - 45 15
d y 24 9
d y 25 10
Total - 49 19
data:
structure(list(desc = c("a", NA, "b", "b", NA, "c", "c", "c",
NA, "d", "d", NA, "d", "d"), type = c("z", NA, "y", "y", NA,
"x", "x", "x", NA, "x", "x", NA, "y", "y"), xvalue = c(16L, NA,
17L, 18L, NA, 19L, 20L, 21L, NA, 22L, 23L, NA, 24L, 25L), yvalue = c(1L,
NA, 2L, 3L, NA, 4L, 5L, 6L, NA, 7L, 8L, NA, 9L, 10L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))
Here's a full answer adding empty rows and removing janitor's added stuff from #TarJae's answer:
library(dplyr)
library(janitor)
df <- df %>%
na.omit() %>% # maybe you don't need this line
group_by(desc, type) %>%
group_split() %>%
purrr::map_df(., \(x) {x <- x %>% janitor::adorn_totals() %>% rbind(NA)}) %>%
mutate(
desc = ifelse(desc == "Total", NA, desc),
type = ifelse(type == "-", NA, type)
)
I have a data frame that is sorted based on one column(numeric column) to assign the rank. if this column value is zero then arrange the data frame based on another character column for those rows which have zero as a value in a numeric column.
But to give rank I have to consider var2 that is the reason I sorted based on var2, if there is any identical values in var2 for those rows I have to consider var3 to give rank. please see the data frame 2 and 3 rows, var2 values are identical in that case i have to consider var3 to give rank. In case var2 is zero i have to sort the var1 column(character column) in alphabetical order and give rank. if var2 is NA no rank. please refer the data frame given below.
Below, the data frame is sorted based on var2 column descending order, but var2 contains zero also if var2 is zero I have to sort the data frame based on var1 for the rows which are having zero in var2. I need sort by var1 for those rows which are having var2 as zero and followed by NA in alphabetical order of var1.
example:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 z 0 34 5
# 6 d 0 32 6
# 7 c 0 12 7
# 8 a 0 23 8
# 9 e NA
# 10 b NA
below is my code
df <- data.frame(var1=c("c","a","f","b","z","d", "c","a", "e", "b", "ad", "gf", "kg", "ts", "mp"), var2=c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0), var3=c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
# To break the tie between var3 and var2
orderdf <- df[order(df$var2, df$var1, decreasing = TRUE), ]
#assigning rank
rankdf <- orderdf %>% mutate(rank = ifelse(is.na(var2),'', seq(1:nrow(orderdf))))
expected output is sort the var1 in alphabetical order if var2 value is zero(for those rows with var2 value is zero)
expected output:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 a 0 34 5
# 6 c 0 32 6
# 7 d 0 12 7
# 8 z 0 23 8
# 9 b NA
# 10 e NA
With dplyr you can use
df %>%
arrange(desc(var2), var1)
and afterwards you create the column rank
EDIT
The following code is a bit cumbersome but it gets the job done. Basically it orders the rows in which var2 is equal or different from zero separately, then combines the two ordered dataframes together and finally creates the rank column.
Data
df <- data.frame(
var1 = c("c","a","f","b","z","d", "c","a", "e", "z", "ad", "gf", "kg", "ts", "mp"),
var2 = c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0),
var3 = as.numeric(c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
)
df
# var1 var2 var3
# 1 c 134 65
# 2 a NA NA
# 3 f 345 45
# 4 b 200 34
# 5 z 556 68
# 6 d NA NA
# 7 c 345 73
# 8 a 200 12
# 9 e 150 35
# 10 z 0 23
# 11 ad 25 34
# 12 gf 10 56
# 13 kg 0 56
# 14 ts 150 78
# 15 mp 0 123
Code
df %>%
# work on rows with var2 different from 0 or NA
filter(var2 != 0) %>%
arrange(desc(var2), desc(var3)) %>%
# merge with rows with var2 equal to 0 or NA
bind_rows(df %>% filter(var2 == 0 | is.na(var2)) %>% arrange(var1)) %>%
arrange(desc(var2)) %>%
# create the rank column only for the rows with var2 different from NA
mutate(
rank = seq_len(nrow(df)),
rank = ifelse(is.na(var2), NA, rank)
)
Output
# var1 var2 var3 rank
# 1 z 556 68 1
# 2 c 345 73 2
# 3 f 345 45 3
# 4 b 200 34 4
# 5 a 200 12 5
# 6 ts 150 78 6
# 7 e 150 35 7
# 8 c 134 65 8
# 9 ad 25 34 9
# 10 gf 10 56 10
# 11 kg 0 56 11
# 12 mp 0 123 12
# 13 z 0 23 13
# 14 a NA NA NA
# 15 d NA NA NA
Using only base R's order() function, sort first on descending order of var2 then ascending order of var1 to sort the data by passing the subsequent integer vector to square braces
df[order(-df$var2, df$var1), ]
Adding a rank column too is then just
df[order(-df$var2, df$var1), "rank"] <- 1:length(df$var1)
Using data.table
library(data.table)
setDT(df)[order(-var2, var1)][, rank := seq_len(.N)][]
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(1456L, 456L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA)),
class = "data.frame", row.names = c(NA, -10L))
You can do it in base R, using order :
cols <- c('var1', 'var2')
remaining_cols <- setdiff(names(df), cols)
df1 <- df[cols]
cbind(transform(df1[with(df1, order(-var2, var1)), ],
rank = seq_len(nrow(df1))), df[remaining_cols])
# var1 var2 rank var3
#1 c 556 1 45
#2 a 345 2 35
#3 f 345 3 64
#4 b 134 4 87
#8 a 0 5 34
#7 c 0 6 32
#6 d 0 7 12
#5 z 0 8 23
#10 b NA 9 10
#9 e NA 10 11
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(556L, 345L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA),
var3 = c(45L, 35L, 64L, 87L, 34L, 32L, 12L, 23L, 10L, 11L
)), class = "data.frame", row.names = c(NA, -10L))
below, I have demonstrated part of my data:
df<-read.table(text=" K G M
12 2345 Gholi
KAM 2345 KAM
Noghl 1990 KAM
Zae 1990 441
12 2345 441
KAM 1990 12
Noghl 1800 12"
,header=TRUE)
I would like to make codes for K, G and M and starting with 1. We have 4 groups in K, so 1,2,3 and 4. for G, start with 5, so 5, 6 and 7 as we have three subgroups.
Using the following codes, I will get the following table:
df = lapply(df, function(x) as.integer(as.factor(x)))
data.frame(Map("+", df, cumsum(c(0, head(sapply(df, max), -1)))))
I will get the following table:
KM KN KZ
1 7 10
2 7 11
3 6 11
4 6 9
1 7 9
2 6 8
3 5 8
Now I want to get the following table:
Group C
K,M 1,8
K,M 2,11
K 3
K 4
G 7
G 6
M 10
M 9
For example, in Group, Column K (12), M (12,12) goes to Codes 1 and 8, as they coded in KM and KZ and so
After converting all the columns to integer values based on factor route, and adding the max value of previous columns to the current, we pivot to 'long' format with pivot_longer, bind with the original column values reshaped to 'long' format, grouped by the original value column 'origval', paste the unique elements of other columns
library(dplyr)
library(tidyr)
df %>%
mutate_all(~ as.integer(factor(.))) %>%
mutate(G = max(K) + G, M = max(G) + M) %>%
pivot_longer(everything()) %>%
bind_cols(df %>%
mutate_all(as.character) %>%
pivot_longer(everything(),values_to = 'origvalue') %>%
dplyr::select(-name)) %>%
group_by(origvalue) %>%
summarise_at(vars(-group_cols()), ~toString(unique(.))) %>%
dplyr::select(Group = name, C = value)
# A tibble: 9 x 2
# Group C
# <chr> <chr>
#1 K, M 1, 8
#2 G 5
#3 G 6
#4 G 7
#5 M 9
#6 M 10
#7 K, M 2, 11
#8 K 3
#9 K 4
data
df <- structure(list(K = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("12",
"KAM", "Noghl", "Zae"), class = "factor"), G = c(2345L, 2345L,
1990L, 1990L, 2345L, 1990L, 1800L), M = structure(c(3L, 4L, 4L,
2L, 2L, 1L, 1L), .Label = c("12", "441", "Gholi", "KAM"),
class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
I'm dealing with a large dataframe (over 100 columns) and I need to rename the columns. Let's say the dataframe of interest looks like this:
C D E F G H
1 10 200 50 40 60 10
2 30 400 20 30 30 10
3 20 40 30 30 50 10
I also have a "relational" dataframe with rows matching original column names to the desired new names that looks like this:
Code Name
1 C Cat
2 D Dog
3 E Emu
4 F Fish
5 G Goat
6 H Hog
What I'm looking for is a base or package function that allows me to use this match dataframe to rename the original columns, yielding a final dataframe that looks like this:
Cat Dog Emu Fish Goat Hog
1 10 200 50 40 60 10
2 30 400 20 30 30 10
3 20 40 30 30 50 10
Remember, the real application has something like 100+ columns, so the smallest amount of by hand coding possible is desirable here-- Thanks!
It can be done with rename_at (assuming that the columns 'code', 'Name' are character class in the relational dataset)
library(dplyr)
df1 %>%
rename_at(vars(relational$Code), ~ relational$Name)
Or another option is setnames from data.table
library(data.table)
setDT(df1)
setnames(df1, relational$Code, relational$Name)
Or in base R
names(df1) <- setNames(relational$Name, relational$Code)[names(df1)]
data
df1 <- structure(list(C = c(10L, 30L, 20L), D = c(200L, 400L, 40L),
E = c(50L, 20L, 30L), F = c(40L, 30L, 30L), G = c(60L, 30L,
50L), H = c(10L, 10L, 10L)), class = "data.frame", row.names = c("1",
"2", "3"))
relational <- structure(list(Code = c("C", "D", "E", "F", "G", "H"), Name = c("Cat",
"Dog", "Emu", "Fish", "Goat", "Hog")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6"))
We can use match, to match the column names with Code and get the corresponding Name.
names(df) <- relational$Name[match(names(df), relational$Code)]
df
# Cat Dog Emu Fish Goat Hog
#1 10 200 50 40 60 10
#2 30 400 20 30 30 10
#3 20 40 30 30 50 10
I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1
12 13 a
13 15 b
14 17 c
4 9 d
. . .
. . .
81 23 h
File 2
12 13 e
3 10 b
14 17 c
4 9 j
. . .
. . .
1 2 k
File 3
12 13 m
13 15 k
1 7 x
24 9 d
. . .
. . .
1 2 h
and so on.
I want to merge them to obtain the following result
12 13 a e m
13 15 b k
14 17 c c
4 9 d j
3 10 b
24 9 d
. . .
. . .
81 23 h
1 2 k
1 7 x
The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).
However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.
Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.
Here are two approaches:
Option 1: "dplyr" + "tidyr"
Use mget to put all of your data.frames into a list.
Use rbind_all to convert that list into a single data.frame.
Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)
combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
group_by(V1, V2) %>%
mutate(time = sequence(n())) %>%
ungroup() %>%
spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
#
# V1 V2 1 2 3
# 1 1 7 x
# 2 3 10 b
# 3 4 9 d j
# 4 12 13 a e m
# 5 13 15 b k
# 6 14 17 c c
# 7 24 9 d
Option 2: "data.table"
Use mget to put all of your data.frames into a list.
Use rbindlist to convert that list into a single data.table.
Use sequence(.N) to generate your sequence by your groups.
Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)
dcast.data.table(
rbindlist(mget(ls(pattern = "^file\\d")))[,
time := sequence(.N), by = list(V1, V2)],
V1 + V2 ~ time, value.var = "V3", fill = "")
# V1 V2 1 2 3
# 1: 1 7 x
# 2: 3 10 b
# 3: 4 9 d j
# 4: 12 13 a e m
# 5: 13 15 b k
# 6: 14 17 c c
# 7: 24 9 d
Both of these answers assume we are starting with the following sample data:
file1 <- structure(
list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L),
V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L),
V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L),
V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))