This question already has an answer here:
How to sort a matrix/data.frame by all columns
(1 answer)
Closed 5 years ago.
I have a database as a data frame and I would like to order all columns, but keeping relations between elements.
For example, if I do the following:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> DF=DF[order(A,B,C,D),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
6 51 5 332 1
7 51 5 332 1
5 51 5 332 2
Ok, this is what I wanted (pay atention to the last two rows), but I would like to have a generic solution, independent of the number of columns. I have tried the following, but it does not work.
> DF=DF[order(colnames(DF)),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
I would be grateful if someone could help me with this little issue. Regards.
We can use do.call with order for ordering on all the columns of a dataset
DF[do.call(order, DF),]
If we use tidyverse, there is arrange_at that will take column names
library(dplyr)
DF %>%
arrange_at(vars(names(.)))
#or as #Sotos commented
#arrange_all()
#or
#arrange(!!! rlang::syms(names(.)))
# A B C D
#1 11 2 432 4
#2 11 3 432 4
#3 13 4 241 5
#4 42 5 2 3
#5 51 5 332 1
#6 51 5 332 1
#7 51 5 332 2
Related
I have two dataframes of different sizes. Example:
t1 <- data.frame("id"=c(1,1,1,2,2,2,4,5,5,5,6,7,8),"condition"=c(3,3,1,5,5,5,10,10,5,5,2,3,1) )
t2 <- data.frame("ind"=c(1,2,4,5,6,7,8),"test_c"=c(3,5,10,10,2,3,1), "time"=c(32,55,21,34,55,22,19))
I would like to match the cases based on two criteria:
t1$id==t2$ind and t1$condition==t2$test_c and create an additional column in t1 based on the outcome of the variable t2$time under these two conditions.
Expected outcome:
t3 <- data.frame("id"=c(1,1,1,2,2,2,4,5,5,5,6,7,8),"condition"=c(3,3,1,5,5,5,10,10,5,5,2,3,1) , "time"=c (32,32,NA,55,55,55,21,34,NA,NA,55,22,19))
I suspect I should use merge or match functions but I am not sure which would be the right approach.
Base R
> out <- merge(t1, t2, by.x=c("id","condition"), by.y=c("ind","test_c"), all.x=TRUE)
> out
id condition time
1 1 1 NA
2 1 3 32
3 1 3 32
4 2 5 55
5 2 5 55
6 2 5 55
7 4 10 21
8 5 5 NA
9 5 5 NA
10 5 10 34
11 6 2 55
12 7 3 22
13 8 1 19
dplyr
library(dplyr)
left_join(t1, t2, by = c("id" = "ind", "condition" = "test_c"))
Differences with your t3
There are some differences between them. For the sake of display, I'll show them side-by-side, arranged so that we have an easier comparison.
cbind(out[with(out,order(id,condition)),], t3[with(t3,order(id,condition)),])
# id condition time id condition time
# 1 1 1 NA 1 1 NA
# 2 1 3 32 1 3 32
# 3 1 3 32 1 3 32
# 4 2 5 55 2 5 55
# 5 2 5 55 2 5 NA
# 6 2 5 55 2 5 NA
# 7 4 10 21 4 10 21
# 8 5 5 NA 5 5 NA
# 9 5 5 NA 5 5 NA
# 10 5 10 34 5 10 34
# 11 6 2 55 6 2 55
# 12 7 3 22 7 3 22
# 13 8 1 19 8 1 19
The only differences are with id=2,condition=5, where all of them in the merge are assigned the same time=55, and your t3 fills only the first of them. I don't think this is a "first only" logic, as there are other repeat id,condition that do not elicit the same response. I suspect this is just a mistake with the sample data, or perhaps there is post-merge processing you haven't told us yet :-)
In case you want to use match you can use in addition interaction (or paste) to use multiple columns.
t1$time <- t2[match(interaction(t1), interaction(t2[-3])), 3]
t1
# id condition time
#1 1 3 32
#2 1 3 32
#3 1 1 NA
#4 2 5 55
#5 2 5 55
#6 2 5 55
#7 4 10 21
#8 5 10 34
#9 5 5 NA
#10 5 5 NA
#11 6 2 55
#12 7 3 22
#13 8 1 19
numbers1 <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
and
numbers2 <- c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
to peform counting i do so manually
as.data.frame(table(numbers1))
as.data.frame(table(numbers2))
but i can have 100 variables from mydat$x1 to mydat$100.
I don't want manually enter 100 times.
How to do that all counting would for all variables?
as.data.frame(table(mydat$x1-mydat$x100))
is not working.
We can make a list of all variables in the environment that have a pattern like numbers. Then we can loop through all of the elements of the list:
number_lst <- mget(ls(pattern = 'numbers\\d'), envir = .GlobalEnv) #thanks NelsonGon
lapply(number_lst, function(x) as.data.frame(table(x)))
$numbers1
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
$numbers2
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
As I read your question, you want to count the number of times each unique element in a set occurs using minimal re-typing over many sets.
To do this, you'll first need to put the sets into a single object, e.g. into a list:
list_of_sets <- list(numbers1 = c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435),
numbers2 = c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435))
Then you loop over each list element, e.g. using a for loop:
list_of_counts <- list()
for(i in seq_along(list_of_sets)){
list_of_counts[[i]] <- as.data.frame(table(list_of_sets[[i]]))
}
list_of_counts then contains the results:
[[1]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
[[2]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
I have run this SQL sentence through the package: sqldf
SELECT A,B, COUNT(*) AS NUM
FROM DF
GROUP BY A,B
I have got the output I wanted, but I would like to keep the initial row order. Unfortunately, the output has a different order.
For example:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> sqldf("SELECT A,B,C,D, COUNT (*) AS NUM
+ FROM DF
+ GROUP BY A,B,C,D")
A B C D NUM
1 11 2 432 4 1
2 11 3 432 4 1
3 13 4 241 5 1
4 42 5 2 3 1
5 51 5 332 1 2
6 51 5 332 2 1
As you can see the row order changes, (row number 5 and 6). It would be great if someone could help me with this issue.
Regards,
If we need to use this with sqldf, use ORDER.BY with names pasted together
library(sqldf)
nm <- toString(names(DF))
DF1 <- cbind(rn = seq_len(nrow(DF)), DF)
nm1 <- toString(names(DF1))
fn$sqldf("SELECT $nm, COUNT (*) AS NUM
FROM DF1
GROUP BY $nm ORDER BY $nm1")
# A B C D NUM
#1 11 2 432 4 1
#2 11 3 432 4 1
#3 13 4 241 5 1
#4 42 5 2 3 1
#5 51 5 332 2 1
#6 51 5 332 1 2
Here an example of my dataframe:
df = read.table(text = 'a b
120 5
120 5
120 5
119 0
118 0
88 3
88 3
87 0
10 3
10 3
10 3
7 4
6 0
5 0
4 0', header = TRUE)
I need to replace the 0s within col b with each preceding number diverse than 0.
Here my desired output:
a b
120 5
120 5
120 5
119 5
118 5
88 3
88 3
87 3
10 3
10 3
10 3
7 4
6 4
5 4
4 4
Until now I tried:
df$b[df$b == 0] = (df$b == 0) - 1
But it does not work.
Thanks
na.locf from zoo can help with this:
library(zoo)
#converting zeros to NA so that na.locf can get them
df$b[df$b == 0] <- NA
#using na.locf to replace NA with previous value
df$b <- na.locf(df$b)
Out:
> df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
Performing this task in a simple condition seems pretty hard, but you could also use a small for loop instead of loading a package.
for (i in which(df$b==0)) {
df$b[i] = df$b[i-1]
}
Output:
> df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
I assume that this could be slow for large data.frames
Here is a base R method using rle.
# get the run length encoding of variable
temp <- rle(df$b)
# fill in 0s with previous value
temp$values[temp$values == 0] <- temp$values[which(temp$values == 0) -1]
# replace variable
df$b <- inverse.rle(temp)
This returns
df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
Note that the replacement line will throw an error if the first element of the vector is 0. You can fix this by creating a vector that excludes it.
For example
replacers <- which(temp$values == 0)
replacers <- replacers[replacers > 1]
I want to sum two tables in R, but they have different valid categories, which produces two different dimensions. How can I add them up?
Example:
table(VA)
1 2 3 4 6 7 8 9 10
652 1 300 777 9 615 167 26 67
table(VB)
1 2 3 4 5 6 7 8 9 10
285 5 282 367 1 12 289 129 33 1118
table(V2A)+table(V2B)
Error in table(cx$V2A) + table(cx$V2B) : non-conformable arrays
What can I do to solve this?
I guess VA and VB are vectors. To effectively sum the tables, all you need to do is this:
table(c(VA,VB))
> VA <- sample(1:10,20,replace=TRUE)
> VB <- sample(1:10,20,replace=TRUE)
> table(VA)
VA
1 2 3 4 5 6 7 9 10
1 3 3 2 3 2 2 2 2
> table(VB)
VB
1 2 4 5 6 7 8 9 10
1 2 2 2 4 3 1 2 3
> table(c(VA,VB))
1 2 3 4 5 6 7 8 9 10
2 5 3 4 5 6 5 1 4 5