I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,
df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234,
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")
Which looks like this
df
id var1 var2
1 110 26 234
2 110 21 12
3 110 54 43
4 90 10 32
5 90 18 21
6 90 9 19
7 90 16 16
8 252 54 34
9 252 39 44
Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,
> apply(df, 2, sort)
id var1 var2
[1,] 90 9 12
[2,] 90 10 16
[3,] 90 16 19
[4,] 90 18 21
[5,] 110 21 32
[6,] 110 26 34
[7,] 110 39 43
[8,] 252 54 44
[9,] 252 54 234
However, this is not the output I am seeking. The correct output should be,
id var1 var2
1 110 21 12
2 110 26 234
3 110 54 43
4 90 9 19
5 90 10 32
6 90 16 16
7 90 18 21
8 252 39 44
9 252 54 34
Group by id and sort by var1 column and keep original id column order.
Any idea how to sort like this?
Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:
df[order(ordered(df$id, unique(df$id)), df$var1), ]
A one-liner tidyverse solution w/o any temp vars:
library(tidyverse)
df %>% arrange(ordered(id, unique(id)), var1)
# id var1 var2
# 1 110 26 234
# 2 110 21 12
# 3 110 54 43
# 4 90 10 32
# 5 90 18 21
# 6 90 9 19
# 7 90 16 16
# 8 252 54 34
# 9 252 39 44
Explanation of why apply(df, 2, sort) does not work
What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).
apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:
# [1,] 90 9 12
does not even exist in the original data.frame.
Another base R option using order and match
df[with(df, order(match(id, unique(id)), var1, var2)), ]
# id var1 var2
#2 110 21 12
#1 110 26 234
#3 110 54 43
#6 90 9 19
#4 90 10 32
#7 90 16 16
#5 90 18 21
#9 252 39 44
#8 252 54 34
We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.
df$id <- factor(df$id, levels = unique(df$id))
do.call(rbind, lapply(split(df, df$id), function(i)i[order(i$var1),]))
# id var1 var2
#110.2 110 21 12
#110.1 110 26 234
#110.3 110 54 43
#90.6 90 9 19
#90.4 90 10 32
#90.7 90 16 16
#90.5 90 18 21
#252.9 252 39 44
#252.8 252 54 34
NOTE: You can reset the rownames by rownames(new_df) <- NULL
In base R we could use split<- :
split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )
or as #Markus suggests :
split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])
output in either case :
df
# id var1 var2
# 1 110 21 12
# 2 110 26 234
# 3 110 54 43
# 4 90 9 19
# 5 90 10 32
# 6 90 16 16
# 7 90 18 21
# 8 252 39 44
# 9 252 54 34
With the following tidyverse pipe, the question's output is reproduced.
library(tidyverse)
df %>%
mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
group_by(id) %>%
arrange(tmp, var1) %>%
select(-tmp)
## A tibble: 9 x 3
## Groups: id [3]
# id var1 var2
# <dbl> <dbl> <dbl>
#1 110 21 12
#2 110 26 234
#3 110 54 43
#4 90 9 19
#5 90 10 32
#6 90 16 16
#7 90 18 21
#8 252 39 44
#9 252 54 34
Related
I am looking to full_join two datasets by an ID number, yet I am having the following issues:
Here is a snapshot of my data
ID= c(112,112,111,110)
code_2020= c(23,22,23,20)
df_2020 <- data.frame(ID,code_2020)
ID= c(112,111,110,109)
code_2021= c(23,23,23,23)
df_2021 <- data.frame(ID,code_2021)
I want to full_join both dataset but because in 2020 there is a person with ID as 112 appearing twice (once with code 23 and once with code 22), I get a repetition in the code_2021 with 112 and 23 appearing twice rather than once for 2021.
Here is what I get with my full join df_2020_2021<- full_join(df_2020,df_2021,by="ID")
Output I get:
ID
code_2020
code_2021
112
23
23
112
22
23
111
23
23
110
20
23
109
NA
23
Output I want:
ID
code_2020
code_2021
112
23
23
112
22
NA
111
23
23
110
20
23
109
NA
23
the difference is in the NA in row 2 (code_2021 of ID=112).
Edit: in terms of which 112 should be matched, I would be happy for whichever 112 to be matched (so the first 112 to be matched would work).
Number the duplicates and include that in the join:
df_2020 %>%
group_by(ID) %>%
mutate(occurrence = row_number()) %>%
ungroup() %>%
full_join(mutate(df_2021, occurrence = 1)) %>%
select(-occurrence)
# Joining, by = c("ID", "occurrence")
# # A tibble: 5 × 3
# ID code_2020 code_2021
# <dbl> <dbl> <dbl>
# 1 112 23 23
# 2 112 22 NA
# 3 111 23 23
# 4 110 20 23
# 5 109 NA 23
We are looking to create a vector with the following sequence:
1,4,5,8,9,12,13,16,17,20,21,...
Start with 1, then skip 2 numbers, then add 2 numbers, then skip 2 numbers, etc., not going above 2000. We also need the inverse sequence 2,3,6,7,10,11,...
We may use recyling vector to filter the sequence
(1:21)[c(TRUE, FALSE, FALSE, TRUE)]
[1] 1 4 5 8 9 12 13 16 17 20 21
Here's an approach using rep and cumsum. Effectively, "add up alternating increments of 1 (successive #s) and 3 (skip two)."
cumsum(rep(c(1,3), 500))
and
cumsum(rep(c(3,1), 500)) - 1
Got this one myself - head(sort(c(seq(1, 2000, 4), seq(4, 2000, 4))), 20)
We can try like below
> (v <- seq(21))[v %% 4 %in% c(0, 1)]
[1] 1 4 5 8 9 12 13 16 17 20 21
You may arrange the data in a matrix and extract 1st and 4th column.
val <- 1:100
sort(c(matrix(val, ncol = 4, byrow = TRUE)[, c(1, 4)]))
# [1] 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33
#[18] 36 37 40 41 44 45 48 49 52 53 56 57 60 61 64 65 68
#[35] 69 72 73 76 77 80 81 84 85 88 89 92 93 96 97 100
A tidyverse option.
library(purrr)
library(dplyr)
map_int(1:11, ~ case_when(. == 1 ~ as.integer(1),
. %% 2 == 0 ~ as.integer(.*2),
T ~ as.integer((.*2)-1)))
# [1] 1 4 5 8 9 12 13 16 17 20 21
I'm currently on R trying to create for a DF multiple columns with the sum of previous one. Imagine I got a DF like this:
df=
sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
1 70 153 NA 28 19
2 57 68 73 118 16
3 29 NA 19 32 36
4 177 36 3 54 53
and I want to add at the end the sum of the rows previous of the month that I'm reporting so for October you end up with the sum of sep and oct, and for November you end up with the sum of sep, oct and november and end up with something like this:
df=
sep-2016 oct-2016 nov-2016 dec-2016 jan-2017 status-Oct2016 status-Nov 2016
1 70 153 NA 28 19 223 223
2 57 68 73 118 16 105 198
3 29 NA 19 32 36 29 48
4 177 36 3 54 53 213 93
I want to know a efficient way insted of writing a lots of lines of rowSums() and even if I can get the label on the iteration for each month would be amazing!
Thanks!
We can use lapply to loop through the columns to apply the rowSums.
dat2 <- as.data.frame(lapply(2:ncol(dat), function(i){
rowSums(dat[, 1:i], na.rm = TRUE)
}))
names(dat2) <- paste0("status-", names(dat[, -1]))
dat3 <- cbind(dat, dat2)
dat3
# sep-2016 oct-2016 nov-2016 dec-2016 jan-2017 status-oct-2016 status-nov-2016 status-dec-2016 status-jan-2017
# 1 70 153 NA 28 19 223 223 251 270
# 2 57 68 73 118 16 125 198 316 332
# 3 29 NA 19 32 36 29 48 80 116
# 4 177 36 3 54 53 213 216 270 323
DATA
dat <- read.table(text = " 'sep-2016' 'oct-2016' 'nov-2016' 'dec-2016' 'jan-2017'
1 70 153 NA 28 19
2 57 68 73 118 16
3 29 NA 19 32 36
4 177 36 3 54 53",
header = TRUE, stringsAsFactors = FALSE)
names(dat) <- c("sep-2016", "oct-2016", "nov-2016", "dec-2016", "jan-2017")
Honestly I have no idea why you would want your data in this format, but here is a tidyverse method of accomplishing it. It involves transforming the data to a tidy format before spreading it back out into your wide format. The key thing to note is that in a tidy format, where month is a variable in a single column instead of spread across multiple columns, you can simply use group_by(rowid) and cumsum to calculate all the values you want. The last few lines are constructing the status- column names and spreading the data back out into a wide format.
library(tidyverse)
df <- read_table2(
"sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
70 153 NA 28 19
57 68 73 118 16
29 NA 19 32 36
177 36 3 54 53"
)
df %>%
rowid_to_column() %>%
gather("month", "value", -rowid) %>%
arrange(rowid) %>%
group_by(rowid) %>%
mutate(
value = replace_na(value, 0),
status = cumsum(value)
) %>%
gather("vartype", "number", value, status) %>%
mutate(colname = ifelse(vartype == "value", month, str_c("status-", month))) %>%
select(rowid, number, colname) %>%
spread(colname, number)
#> # A tibble: 4 x 11
#> # Groups: rowid [4]
#> rowid `dec-2016` `jan-2017` `nov-2016` `oct-2016` `sep-2016`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 28.0 19.0 0 153 70.0
#> 2 2 118 16.0 73.0 68.0 57.0
#> 3 3 32.0 36.0 19.0 0 29.0
#> 4 4 54.0 53.0 3.00 36.0 177
#> # ... with 5 more variables: `status-dec-2016` <dbl>,
#> # `status-jan-2017` <dbl>, `status-nov-2016` <dbl>,
#> # `status-oct-2016` <dbl>, `status-sep-2016` <dbl>
Created on 2018-02-16 by the reprex package (v0.2.0).
A clean way to do it is by convert your data in a long format.
library(tibble)
library(tidyr)
library(dplyr)
your_data <- tribble(~"sep_2016", ~"oct_2016", ~"nov_2016", ~"dec_2016", ~"jan_2017",
70, 153, NA, 28, 19,
57, 68, 73, 118, 16,
29, NA, 19, 32, 36,
177, 36, 3, 54, 53)
You can change the format of your data.frame with gather from the tidyr package.
your_data_long <- your_data %>%
rowid_to_column() %>%
gather(key = month_year, value = the_value, -rowid)
head(your_data_long)
#> # A tibble: 6 x 3
#> rowid month_year the_value
#> <int> <chr> <dbl>
#> 1 1 sep_2016 70
#> 2 2 sep_2016 57
#> 3 3 sep_2016 29
#> 4 4 sep_2016 177
#> 5 1 oct_2016 153
#> 6 2 oct_2016 68
Once your data.frame is in a long format. You can compute cumulative sum with cumsumand dplyrfunctions mutate and group_by.
result <- your_data_long %>%
group_by(rowid) %>%
mutate(cumulative_value = cumsum(the_value))
result
#> # A tibble: 20 x 4
#> # Groups: rowid [4]
#> rowid month_year the_value cumulative_value
#> <int> <chr> <dbl> <dbl>
#> 1 1 sep_2016 70 70
#> 2 2 sep_2016 57 57
#> 3 3 sep_2016 29 29
#> 4 4 sep_2016 177 177
#> 5 1 oct_2016 153 223
#> 6 2 oct_2016 68 125
#> 7 3 oct_2016 NA NA
#> 8 4 oct_2016 36 213
#> 9 1 nov_2016 NA NA
#> 10 2 nov_2016 73 198
#> 11 3 nov_2016 19 NA
#> 12 4 nov_2016 3 216
#> 13 1 dec_2016 28 NA
#> 14 2 dec_2016 118 316
#> 15 3 dec_2016 32 NA
#> 16 4 dec_2016 54 270
#> 17 1 jan_2017 19 NA
#> 18 2 jan_2017 16 332
#> 19 3 jan_2017 36 NA
#> 20 4 jan_2017 53 323
If you want to retrieve the starting form, you can do it with spread.
My preferred solution would be:
# library(matrixStats)
DF <- as.matrix(df)
DF[is.na(DF)] <- 0
RES <- matrixStats::rowCumsums(DF)
colnames(RES) <- paste0("status-", colnames(DF))
cbind.data.frame(df, RES)
This is closest to what you are looking for with the rowSums.
One option could be using spread and gather function from tidyverse.
Note: The status column has been added even for the 1st month. And the status columns are not in order but values are correct.
The approach is:
# Data
df <- read.table(text = "sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
70 153 NA 28 19
57 68 73 118 16
29 NA 19 32 36
177 36 3 54 53", header = T, stringsAsFactors = F)
library(tidyverse)
# Just add an row number as sl
df <- df %>% mutate(sl = row_number())
#Calculate the cumulative sum after gathering and arranging by date
mod_df <- df %>%
gather(key, value, -sl) %>%
mutate(key = as.Date(paste("01",key, sep="."), format="%d.%b.%Y")) %>%
arrange(sl, key) %>%
group_by(sl) %>%
mutate(status = cumsum(ifelse(is.na(value),0L,value) )) %>%
select(-value) %>%
mutate(key = paste("status",as.character(key, format="%b.%Y"))) %>%
spread(key, status)
# Finally join cumulative calculated sum columns with original df and then
# remove sl column
inner_join(df, mod_df, by = "sl") %>% select(-sl)
# sep.2016 oct.2016 nov.2016 dec.2016 jan.2017 status Dec.2016 status Jan.2017 status Nov.2016 status Oct.2016 status Sep.2016
#1 70 153 NA 28 19 251 270 223 223 70
#2 57 68 73 118 16 316 332 198 125 57
#3 29 NA 19 32 36 80 116 48 29 29
#4 177 36 3 54 53 270 323 216 213 177
Another base solution where we build a matrix accumulating the row sums :
status <- setNames(
as.data.frame(t(apply(dat,1,function(x) Reduce(sum,'[<-'(x,is.na(x),0),accumulate = TRUE)))),
paste0("status-",names(dat)))
status
# status-sep-2016 status-oct-2016 status-nov-2016 status-dec-2016 status-jan-2017
# 1 70 223 223 251 270
# 2 57 125 198 316 332
# 3 29 29 48 80 116
# 4 177 213 216 270 323
Then bind it to your original data if needed :
cbind(dat,status[-1])
Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE
Ok, first of all let me generate some sample data:
A_X01 <- c(34, 65, 23, 43, 22)
A_X02 <- c(2, 4, 7, 8, 3)
B_X01 <- c(24, 45, 94, 23, 54)
B_X02 <- c(4, 2, 4, 9, 1)
C_X01 <- c(34, 65, 876, 45, 87)
C_X02 <- c(123, 543, 86, 87, 34)
Var <- c(3, 5, 7, 2, 3)
DF <- data.frame(A_X01, A_X02, B_X01, B_X02, C_X01, C_X02, Var)
What I want to do is apply an equation to the concurrent columns of A and B for both X01 and X02, with a third column "Var" used in the equation.
So far I have been doing this the following way:
DF$D_X01 <- (DF$A_X01 + DF$B_X01) * DF$Var
DF$D_X02 <- (DF$A_X02 + DF$B_X02) * DF$Var
My desired output is as follows:
A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
1 34 2 24 4 34 123 3 174 18
2 65 4 45 2 65 543 5 550 30
3 23 7 94 4 876 86 7 819 77
4 43 8 23 9 45 87 2 132 34
5 22 3 54 1 87 34 3 228 12
As you'll appreciate this is a lot of lines of code to do something fairly simple. Meaning at present my scripts are rather long (as I have multiple columns in the actual dataset)!
One of the apply functions must be the way to go but I can't seem to get my head around it for concurrent columns. I did think about using lapply but how would I get this to work for the two lists of columns and for the right columns to be added together?
I've looked around and can't seem to find a way to do this which must be a fairly common problem?
Thanks.
EDIT:
Original question was a bit confusing so have updated with a desired output and some extra conditions.
Try this
indx <- gsub("\\D", "", grep("A_X|B_X", names(DF), value = TRUE)) # Retrieving indexes
indx2 <- DF[grep("A_X|B_X", names(DF))] # Considering only the columns of interest
DF[paste0("D_X", unique(indx))] <-
sapply(unique(indx), function(x) rowSums(indx2[which(indx == x)])*DF$Var)
DF
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
# 1 34 2 24 4 34 123 3 174 18
# 2 65 4 45 2 65 543 5 550 30
# 3 23 7 94 4 876 86 7 819 77
# 4 43 8 23 9 45 87 2 132 34
# 5 22 3 54 1 87 34 3 228 12
You may also try
indxA <- grep("^A", colnames(DF))
indxB <- grep("^B", colnames(DF))
f1 <- function(x,y,z) (x+y)*z
DF[sprintf('D_X%02d', indxA)] <- Map(f1 , DF[indxA], DF[indxB], list(DF$Var))
DF
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
#1 34 2 24 4 34 123 3 174 18
#2 65 4 45 2 65 543 5 550 30
#3 23 7 94 4 876 86 7 819 77
#4 43 8 23 9 45 87 2 132 34
#5 22 3 54 1 87 34 3 228 12
Or you could use mapply
DF[sprintf('D_X%02d', indxA)] <- mapply(`+`, DF[indxA],DF[indxB])*DF$Var