How to keep initial row order - r

I have run this SQL sentence through the package: sqldf
SELECT A,B, COUNT(*) AS NUM
FROM DF
GROUP BY A,B
I have got the output I wanted, but I would like to keep the initial row order. Unfortunately, the output has a different order.
For example:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> sqldf("SELECT A,B,C,D, COUNT (*) AS NUM
+ FROM DF
+ GROUP BY A,B,C,D")
A B C D NUM
1 11 2 432 4 1
2 11 3 432 4 1
3 13 4 241 5 1
4 42 5 2 3 1
5 51 5 332 1 2
6 51 5 332 2 1
As you can see the row order changes, (row number 5 and 6). It would be great if someone could help me with this issue.
Regards,

If we need to use this with sqldf, use ORDER.BY with names pasted together
library(sqldf)
nm <- toString(names(DF))
DF1 <- cbind(rn = seq_len(nrow(DF)), DF)
nm1 <- toString(names(DF1))
fn$sqldf("SELECT $nm, COUNT (*) AS NUM
FROM DF1
GROUP BY $nm ORDER BY $nm1")
# A B C D NUM
#1 11 2 432 4 1
#2 11 3 432 4 1
#3 13 4 241 5 1
#4 42 5 2 3 1
#5 51 5 332 2 1
#6 51 5 332 1 2

Related

Merging two dataframes by keeping certain column values in r

I have two dataframes I need to merge with. The second one has certain columns missing and it also has some more ids. Here is how the sample datasets look like.
df1 <- data.frame(id = c(1,2,3,4,5,6),
item = c(11,22,33,44,55,66),
score = c(1,0,1,1,1,0),
cat.a = c("A","B","C","D","E","F"),
cat.b = c("a","a","b","b","c","f"))
> df1
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
df2 <- data.frame(id = c(1,2,3,4,5,6,7,8),
item = c(11,22,33,44,55,66,77,88),
score = c(1,0,1,1,1,0,1,1),
cat.a = c(NA,NA,NA,NA,NA,NA,NA,NA),
cat.b = c(NA,NA,NA,NA,NA,NA,NA,NA))
> df2
id item score cat.a cat.b
1 1 11 1 NA NA
2 2 22 0 NA NA
3 3 33 1 NA NA
4 4 44 1 NA NA
5 5 55 1 NA NA
6 6 66 0 NA NA
7 7 77 1 NA NA
8 8 88 1 NA NA
The two datasets share first 6 rows and dataset 2 has two more rows. When I merge I need to keep cat.a and cat.b information from the first dataframe. Then I also want to keep id=7 and id=8 with cat.a and cat.b columns missing.
Here is my desired output.
> df3
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>
Any ideas?
Thanks!
We may use rows_update
library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))
-output
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>

How to add a new column based on conditionnal difference between rows

I have a large dataset of ID of patients with delays in days between the surgery and radiotherapy (RT) sessions. Some patients may had two or three RT treatments. To identidy those patients, I consider a delay being greater than 91 days (3 months).
This delay of 91 days corresponds to the end of one RT treatment and the start of another one. For analysis purposes it may be set at 61 days (2 months).
How to make correspond this delay above 91 days between two values to a new RT treatement and add a corresponding order into a new column?
My database looks like this:
df1 <- data.frame (
id = c("a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b", "c","c","c","c"),
delay = c(2,3,5,6, 3,5,7,9, 190,195,201,203,205, 1299,1303,1306,1307, 200,202,204,205))
> df1
id delay
1 a 2
2 a 3
3 a 5
4 a 6
5 b 3
6 b 5
7 b 7
8 b 9
9 b 190
10 b 195
11 b 201
12 b 203
13 b 205
14 b 1299
15 b 1303
16 b 1306
17 b 1307
18 c 200
19 c 202
20 c 204
21 c 205
I failed to produce something like this considering if the time between the first set of delays is greater than 100 days.
df2 <- data.frame (
id = c("a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b", "c","c","c","c"),
delay = c(2,3,5,6, 3,5,7,9, 190,195,201,203,205, 1299,1303,1306,1307, 200,202,204,205),
tt_order = c("1st","1st","1st","1st"," 1st","1st","1st","1st"," 2nd","2nd","2nd","2nd","2nd"," 3rd","3rd","3rd","3rd"," 1st","1st","1st","1st"))
> df2
id delay tt_order
1 a 2 1st
2 a 3 1st
3 a 5 1st
4 a 6 1st
5 b 3 1st
6 b 5 1st
7 b 7 1st
8 b 9 1st
9 b 190 2nd
10 b 195 2nd
11 b 201 2nd
12 b 203 2nd
13 b 205 2nd
14 b 1299 3rd
15 b 1303 3rd
16 b 1306 3rd
17 b 1307 3rd
18 c 200 1st
19 c 202 1st
20 c 204 1st
21 c 205 1st
I will be grateful for any help you can provide.
One way would be to divide delay by 100 and then use match and unique to get unique index in a sequential fashion for each id.
library(dplyr)
df2 %>%
group_by(id) %>%
mutate(n_tt = floor(delay/100),
n_tt = match(n_tt, unique(n_tt)))
# id delay tt_order n_tt
# <chr> <dbl> <dbl> <int>
# 1 a 2 1 1
# 2 a 3 1 1
# 3 a 5 1 1
# 4 a 6 1 1
# 5 b 3 1 1
# 6 b 5 1 1
# 7 b 7 1 1
# 8 b 9 1 1
# 9 b 150 2 2
#10 b 152 2 2
#11 b 155 2 2
#12 b 159 2 2
#13 b 1301 3 3
#14 b 1303 3 3
#15 b 1306 3 3
#16 b 1307 3 3
#17 c 200 1 1
#18 c 202 1 1
#19 c 204 1 1
#20 c 205 1 1
Created a new column n_tt for comparison purposes with tt_order in df2.
#CharlesLDN - perhaps this might be what you are looking for. This will look at differences in delay within each id, and gaps of > 90 days will be considered a new treatment.
library(tidyverse)
df1 %>%
group_by(id) %>%
mutate(tt_order = cumsum(c(0, diff(delay)) > 90) + 1)
Output
id delay tt_order
<chr> <dbl> <dbl>
1 a 2 1
2 a 3 1
3 a 5 1
4 a 6 1
5 b 3 1
6 b 5 1
7 b 7 1
8 b 9 1
9 b 190 2
10 b 195 2
11 b 201 2
12 b 203 2
13 b 205 2
14 b 1299 3
15 b 1303 3
16 b 1306 3
17 b 1307 3
18 c 200 1
19 c 202 1
20 c 204 1
21 c 205 1

calculate count of number observation for all variables at once in R

numbers1 <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
and
numbers2 <- c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
to peform counting i do so manually
as.data.frame(table(numbers1))
as.data.frame(table(numbers2))
but i can have 100 variables from mydat$x1 to mydat$100.
I don't want manually enter 100 times.
How to do that all counting would for all variables?
as.data.frame(table(mydat$x1-mydat$x100))
is not working.
We can make a list of all variables in the environment that have a pattern like numbers. Then we can loop through all of the elements of the list:
number_lst <- mget(ls(pattern = 'numbers\\d'), envir = .GlobalEnv) #thanks NelsonGon
lapply(number_lst, function(x) as.data.frame(table(x)))
$numbers1
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
$numbers2
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
As I read your question, you want to count the number of times each unique element in a set occurs using minimal re-typing over many sets.
To do this, you'll first need to put the sets into a single object, e.g. into a list:
list_of_sets <- list(numbers1 = c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435),
numbers2 = c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435))
Then you loop over each list element, e.g. using a for loop:
list_of_counts <- list()
for(i in seq_along(list_of_sets)){
list_of_counts[[i]] <- as.data.frame(table(list_of_sets[[i]]))
}
list_of_counts then contains the results:
[[1]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
[[2]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1

How to do a generic order [duplicate]

This question already has an answer here:
How to sort a matrix/data.frame by all columns
(1 answer)
Closed 5 years ago.
I have a database as a data frame and I would like to order all columns, but keeping relations between elements.
For example, if I do the following:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> DF=DF[order(A,B,C,D),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
6 51 5 332 1
7 51 5 332 1
5 51 5 332 2
Ok, this is what I wanted (pay atention to the last two rows), but I would like to have a generic solution, independent of the number of columns. I have tried the following, but it does not work.
> DF=DF[order(colnames(DF)),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
I would be grateful if someone could help me with this little issue. Regards.
We can use do.call with order for ordering on all the columns of a dataset
DF[do.call(order, DF),]
If we use tidyverse, there is arrange_at that will take column names
library(dplyr)
DF %>%
arrange_at(vars(names(.)))
#or as #Sotos commented
#arrange_all()
#or
#arrange(!!! rlang::syms(names(.)))
# A B C D
#1 11 2 432 4
#2 11 3 432 4
#3 13 4 241 5
#4 42 5 2 3
#5 51 5 332 1
#6 51 5 332 1
#7 51 5 332 2

Merge 2 dataframes based on condition in R

I have the following 2 data frames that I want to merge:
x <- data.frame(a= 1:11, b =3:13, c=2:12, d=7:17, invoice = 1:11)
x =
a b c d invoice
1 3 2 7 1
2 4 3 8 2
3 5 4 9 3
4 6 5 10 4
5 7 6 11 5
6 8 7 12 6
7 9 8 13 7
8 10 9 14 8
9 11 10 15 9
10 12 11 16 10
11 13 12 17 11
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
> y
nr invoice
100 1
101 1
102 1
103 1
104 1
105 1
106 1
107 1
108 1
109 1
110 1
111 2
112 2
113 2
114 2
115 2
116 2
117 2
I want to merge the letters from dataframe X with dataframe Y when the invoice number is the same. It should start with merging the value from letter A, then B ect. This should be happening until the invoice number is not the same anymore and then choose the numbers from invoice nr 2.
the output should be like this:
> output
nr invoice letter_count
100 1 1
101 1 3
102 1 2
103 1 7
104 1 1
105 1 3
106 1 2
107 1 7
108 1 1
109 1 2
110 1 7
111 2 2
112 2 4
113 2 3
114 2 8
115 2 2
116 2 4
I tried to use the merge function with the by argument but this created an error that the number of rows is not the same. Any help I will appreciate.
Here is a solution using the purrr package.
# Prepare the data frames
x <- data.frame(a = 1:11, b = 3:13, c = 2:12, d = 7:17, invoice = 1:11)
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
# Load package
library(purrr)
# Split the data based on invoice
y_list <- split(y, f = y$invoice)
# Design a function to transfer data
trans_fun <- function(main_df, letter_df = x){
# Get the invoice number
temp_num<- unique(main_df$invoice)
# Extract letter_count information from x
add_vec <- unlist(letter_df[letter_df$invoice == temp_num, 1:4])
# Get the remainder of nrow(main_df) and length(add_vec)
reamin_num <- nrow(main_df) %% length(add_vec)
# Get the multiple of nrow(main_df) and length(add_vec)
multiple_num <- nrow(main_df) %/% length(add_vec)
# Create the entire sequence to add
add_seq <- rep(add_vec, multiple_num + 1)
add_seq2 <- add_seq[1:(length(add_seq) - (length(add_vec) - reamin_num))]
# Add new column, add_seq2, to main_df
main_df$letter_count <- add_seq2
return(main_df)
}
# Apply the trans_fun function using map_df
output <- map_df(y_list, .f = trans_fun)
# See the result
output
nr invoice letter_count
1 100 1 1
2 101 1 3
3 102 1 2
4 103 1 7
5 104 1 1
6 105 1 3
7 106 1 2
8 107 1 7
9 108 1 1
10 109 1 3
11 110 1 2
12 111 2 2
13 112 2 4
14 113 2 3
15 114 2 8
16 115 2 2
17 116 2 4
18 117 2 3
19 118 2 8
20 119 2 2
21 120 2 4
22 121 2 3
23 122 2 8
24 123 2 2
25 124 2 4
26 125 2 3

Resources