Sort data frame by column of numbers - r

I am trying to sort a data frame by a column of numbers and I get an alphanumeric sorting of the digits instead. If the data frame is converted to a matrix, the sorting works.
df[order(as.numeric(df[,2])),]
V1 V2
1 a 1
3 c 10
2 b 2
4 d 3
> m <- as.matrix(df)
> m[order(as.numeric(m[,2])),]
V1 V2
[1,] "a" "1"
[2,] "b" "2"
[3,] "d" "3"
[4,] "c" "10"

V1 <- letters[1:4]
V2 <- as.character(c(1,10,2,3))
df <- data.frame(V1,V2, stringsAsFactors=FALSE)
df[order(as.numeric(df[,2])),]
gives
V1 V2
1 a 1
3 c 2
4 d 3
2 b 10
But
V1 <- letters[1:4]
V2 <- as.character(c(1,10,2,3))
df <- data.frame(V1,V2)
df[order(as.numeric(df[,2])),]
gives
V1 V2
1 a 1
2 b 10
3 c 2
4 d 3
which is due to factors.
thanks to the commentators akrun and Imo. Inspect each of the two dfs with str(df).
Also, there is more detail given the factor() function help menu. Scroll down to 'Warning' for more details of the issue at hand.

Could you be a little more specific about what's your intial dataframe ?
Because by running this code :
df<-data.frame(c("a","b","c","d"),c(1,2,10,3))
colnames(df)<-c("V1","V2")
#print(df)
df.order<-df[order(as.numeric(df[,2])),]
print(df.order)
I get the right answer :
V1 V2
1 a 1
2 b 2
4 d 3
3 c 10

Edit:
The column values might be being treated as factors.
Try forcing to character and then integer.
Example copy and pasted from console:
> Foo <- data.frame('ABC' = c('a','b','c','d'),'123' = c('1','2','10','3'))
> Foo[order(as.integer(as.character(Foo[,2]))),]
ABC X123
1 a 1
2 b 2
4 d 3
3 c 10

Related

Change dataframe values R using different column name provided?

I have the following data frame:
Column1 Default_Val
1 A 2
2 B 2
3 C 2
4 D 2
5 E 2
...
colnames: "Column1" "Default_Val"
rownames: "1" "2" "3" "4" "5"
This data frame is part of my function and this function changes the default values according to some if's.
I want to generalize the assignment process because I want to support different column names of this data frame.
Please advise how can I change the default value without being dependent of column names?
Here is what I did so far:
df[Column1 == "A","Default_Val"]
[1] 2
df[Column1 == "A","Default_Val"] = 2
df[Column1 == "A","Default_Val"]
[1] 1
I want something generalized like:
t <- colnames(df)
df[t[1] == "A", t[2]] = 7
For some reason it doesn't work (each time this happens I love Python more :)).
Please advise.
I think it must be straightforward. Please check if this solves your problem.
> df
Column1 Default_val
1 A 1
2 B 3
3 A 4
4 C 1
5 D 4
> df[2][df[1] == 'A'] = 3
> df
Column1 Default_val
1 A 3
2 B 3
3 A 3
4 C 1
5 D 4

Getting columns with equivalent values in rows

I need to get from a number of rows where some columns are equivalent and extract exactly those columns.
I have the following dataframe:
a <- c(1,2,3)
b <- c(1,2,3)
c <- c(4,5,6)
A <- data.frame(a,b,c)
> A
a b c d
1 1 2 4 1
2 2 2 5 2
3 3 3 6 3
I would like the following result:
> columnInnerJoin(A)
a d
1 1 1
2 2 2
3 3 3
Or, more specifically:
> columnInnerJoinGiveColumns(A)
a d
We can try with duplicated
res <- A[duplicated(as.list(A))|duplicated(as.list(A), fromLast=TRUE)]
names(res)
#[1] "a" "d"

finding pairs of duplicate columns in R

thank you for viewing this post. I am a newbie for R language.
I want to find if one column(not specified one) is a duplicate of the other, and return a matrix with dimensions num.duplicates x 2 with each row giving both indices of any pair of duplicated variables. the matrix is organized so that first column is the lower number of the pair, and it is increasing ordered.
Let say I have a dataset
v1 v2 v3 v4 v5 v6
1 1 1 2 4 2 1
2 2 2 3 5 3 2
3 3 3 4 6 4 3
and I want this
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
Please help, thank you!
Something like this I suppose:
out <- data.frame(t(combn(1:ncol(dd),2)))
out[combn(1:ncol(dd),2,FUN=function(x) all(dd[x[1]]==dd[x[2]])),]
# X1 X2
#1 1 2
#5 1 6
#9 2 6
#11 3 5
I feel like i'm missing something more simple, but this seems to work.
Here's the sample data.
dd <- data.frame(
v1 = 1:3, v2 = 1:3, v3 = 2:4,
v4 = 4:6, v5 = 2:4, v6 = 1:3
)
Now i'll assign each column to a group using ave() to look for duplicates. Then I'll count the number of columns in group
groups <- ave(1:ncol(dd), as.list(as.data.frame(t(dd))), FUN=min, drop=T)
Now that I have the groups, i'll split the column indexes up by those groups, if there is more than one, i'll grab all pairwise combinations. That will create a wide matrix and I flip it to a tall-line as you desire with t()
morethanone <- function(x) length(x)>1
dups <- t(do.call(cbind,
lapply(Filter(morethanone, split(1:ncol(dd), groups)), combn, 2)
))
That returns
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
as desired
First, generate all possible combinatons with expand.grid. Second, remove duplicates and sort in desired order. Third, use sapply to find indexes of repeated columns:
kk <- expand.grid(1:ncol(df), 1:ncol(df))
nn <- kk[kk[, 1] > kk[, 2], 2:1]
nn[sapply(1:nrow(nn),
function(i) all(df[, nn[i, 1]] == df[, nn[i, 2]])), ]
Var2 Var1
2 1 2
6 1 6
12 2 6
17 3 5
The approach I propose is R-ish, but I suppose writing a simple double loop is justified for this case, especially if you recently started learning the language.

Check for unique elements

just a simple question.
I have a data frame(only one vector is shown) that looks like:
cln1
A
b
A
A
c
d
A
....
I would like the following output:
cln1
b
c
d
In other words I would like to remove all items that are replicated. The functions "unique" as well as "duplicated" return the output including the replicated element represented one time. I would like to remove it definitively.
You can use setdiff for that :
R> v <- c(1,1,2,2,3,4,5)
R> setdiff(v, v[duplicated(v)])
[1] 3 4 5
You could use count from the plyr package to count the occurences of an item, and delete all who occur more than once.
library(plyr)
l = c(1,2,3,3,4,5,6,6,7)
count_l = count(l)
x freq
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
7 7 1
l[!l %in% with(count_l, x[freq > 1])]
[1] 1 2 4 5 7
Note the !, which means NOT. You of course put this in a oneliner:
l[!l %in% with(count(l), x[freq > 1])]
Another way using table:
With #juba's data:
as.numeric(names(which(table(v) == 1)))
# [1] 3 4 5
For OP's data, since its a character output, as.numeric is not required.
names(which(table(v) == 1))
# [1] "b" "c" "d"

Find groups of duplicates in data frame by all columns except one

I have a large dataframe. For some purposes I need to do the following:
Select one column in this data frame
Iterate on all rows of a given data frame except selected column
Select all rows of this data frame that are equal by all elements except one selected column
Group them by the way that group name is the row index and group values are indexes of duplicated rows.
I have wrote a function for this task, but it works slow because of nested loop. I would like to get some ideas how this code can be improved.
Say we have a dataframe like this:
V1 V2 V3 V4
1 1 2 1 2
2 1 2 2 1
3 1 1 1 2
4 1 1 2 1
5 2 2 1 2
And we want to get this list as a output:
diff.dataframe("V2", conf.new, conf.new)
Ouput:
$`1`
[1] 1
$`2`
[1] 2
$`3`
[1] 1 3
$`4`
[1] 2 4
$`5`
[1] 5
The following code reaces the goal, but it works too slow. Is it possible to improve it somehow?
diff.dataframe <- function(param, df1, df2){
excl.names <- c(param)
df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
list.out <- list()
for (i in 1:nrow(df1.excl)){
for (j in 1:nrow(df2.excl)){
if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
if (!as.character(i) %in% unlist(list.out)){
list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)
}
}
}
}
return(list.out)
}
Let's generate some data first
df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))
# Produces df like this
V1 V2 V3 V4
1 2 1 1 1
2 2 1 2 2
3 1 1 2 2
4 1 2 1 1
5 1 2 1 1
We then loop through the lines with lapply. Each row i is then compared to all lines of df with apply (including itself). The rows with <= 1 differences returns TRUE, the others return FALSE producing a logical vector, which we convert to a numeric vector with which.
lapply(1:nrow(df), function(i)
apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))
# Produces output like this
[[1]]
[1] 1
[[2]]
[1] 2 3
[[3]]
[1] 2 3
[[4]]
[1] 4 5
[[5]]
[1] 4 5

Resources