Sorting in dplyr produces incorrect output

Sorting in dplyr produces incorrect output - r

The arrange() in dplyr produces incorrect result.
library(dplyr)
x <- as.data.frame(cbind(name=c("A","B","C","D"), val=c(0.032, 0.077, 0.4, 0.0001)))
x.1 <- x %>% arrange(val)
x.2 <- x %>% arrange(desc(val))
The outputs are:
name val
1 A 0.032
2 B 0.077
3 C 0.4
4 D 1e-04
>x.1
name val
1 A 0.032
2 B 0.077
3 C 0.4
4 D 1e-04
> x.2
name val
1 D 1e-04
2 C 0.4
3 B 0.077
4 A 0.032
Both ascending and descending order sort producing incorrect output.
Not sure what I am doing wrong here?
Thank you.

as.data.frame(cbind()) is what you are doing wrong there. Everything is converted to character in cbind(), and then to factor in as.data.frame(). Have a look ...
str(x)
# 'data.frame': 4 obs. of 2 variables:
# $ name: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ val : Factor w/ 4 levels "0.032","0.077",..: 1 2 3 4
I don't know where people are learning this method of creating data frames, but it's terrible practice and should never be used.
Use data.frame() to create data frames, that's why it's there (or when using dplyr, there is data_frame() as well).
library(dplyr)
x <- data.frame(name=c("A","B","C","D"), val=c(0.032, 0.077, 0.4, 0.0001))
x.1 <- x %>% arrange(val)
x.2 <- x %>% arrange(desc(val))
x.1
# name val
# 1 D 0.0001
# 2 A 0.0320
# 3 B 0.0770
# 4 C 0.4000
x.2
# name val
# 1 C 0.4000
# 2 B 0.0770
# 3 A 0.0320
# 4 D 0.0001

Related

recode/replace multiple values in a shared data column to a single value across data frames

I hope I haven't missed it, but I haven't been able to find a working solution to this problem.
I have a set of data frames with a shared column. These columns contain multiple and varying transcription errors, some of which are shared, others not, for multiple values.
I would like replace/recode the transcription errors (bad_values) with the correct values (good_values) across all data frames.
I have tried nesting the map*() family of functions across lists of data frames, bad_values, and good_values to do this, among other things. Here is an example:
df1 = data.frame(grp = c("a1","a.","a.",rep("b",7)), measure = rnorm(10))
df2 = data.frame(grp = c(rep("as", 3), "b2",rep("a",22)), measure = rnorm(26))
df3 = data.frame(grp = c(rep("b-",3),rep("bq",2),"a", rep("a.", 3)), measure = 1:9)
df_list = list(df1, df2, df3)
bad_values = list(c("a1","a.","as"), c("b2","b-","bq"))
good_values = list("a", "b")
dfs = map(df_list, function(x) {
x %>% mutate(grp = plyr::mapvalues(grp, bad_values, rep(good_values,length(bad_values))))
})
Which I didn't necessarily expect to work beyond a single good-bad value pair. However, I thought nesting another call to map*() within this might work:
dfs = map(df_list, function(x) {
x %>% mutate(grp = map2(bad_values, good_values, function(x,y) {
recode(grp, bad_values = good_values)})
})
I have tried a number of other approaches, none of which have worked.
Ultimately, I would like to go from a set of data frames with errors, as here:
[[1]]
grp measure
1 a1 0.5582253
2 a. 0.3400904
3 a. -0.2200824
4 b -0.7287385
5 b -0.2128275
6 b 1.9030766
[[2]]
grp measure
1 as 1.6148772
2 as 0.1090853
3 as -1.3714180
4 b2 -0.1606979
5 a 1.1726395
6 a -0.3201150
[[3]]
grp measure
1 b- 1
2 b- 2
3 b- 3
4 bq 4
5 bq 5
6 a 6
To a list of 'fixed' data frames, as such:
[[1]]
grp measure
1 a -0.7671052
2 a 0.1781247
3 a -0.7565773
4 b -0.3606900
5 b 1.9264804
6 b 0.9506608
[[2]]
grp measure
1 a 1.45036125
2 a -2.16715639
3 a 0.80105611
4 b 0.24216723
5 a 1.33089426
6 a -0.08388404
[[3]]
grp measure
1 b 1
2 b 2
3 b 3
4 b 4
5 b 5
6 a 6
Any help would be very much appreciated

Here is an option using tidyverse with recode_factor. When there are multiple elements to be changed, create a list of key/val elements and use recode_factor to match and change the values to new levels
library(tidyverse)
keyval <- setNames(rep(good_values, lengths(bad_values)), unlist(bad_values))
out <- map(df_list, ~ .x %>%
mutate(grp = recode_factor(grp, !!! keyval)))
-output
out
#[[1]]
# grp measure
#1 a -1.63295876
#2 a 0.03859976
#3 a -0.46541610
#4 b -0.72356671
#5 b -1.11552841
#6 b 0.99352861
#....
#[[2]]
# grp measure
#1 a 1.26536789
#2 a -0.48189740
#3 a 0.23041056
#4 b -1.01324689
#5 a -1.41586086
#6 a 0.59026463
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#6 a 6
#....
NOTE: This doesn't change the class of the initial dataset column
str(out)
#List of 3
# $ :'data.frame': 10 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 2
# ..$ measure: num [1:10] -1.633 0.0386 -0.4654 -0.7236 -1.1155 ...
# $ :'data.frame': 26 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 1 1 1 1 1 1 ...
# ..$ measure: num [1:26] 1.265 -0.482 0.23 -1.013 -1.416 ...
# $ :'data.frame': 9 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 2 2 2 2 2 1 1 1 1
# ..$ measure: int [1:9] 1 2 3 4 5 6 7 8 9
Once we have a keyval pair list, this can be also used in base R functions
out1 <- lapply(df_list, transform, grp = unlist(keyval[grp]))

Any reason mapping a case_when statement wouldn't work?
library(tidyverse)
df_list %>%
map(~ mutate_if(.x, is.factor, as.character)) %>% # convert factor to character
map(~ mutate(.x, grp = case_when(grp %in% bad_values[[1]] ~ good_values[[1]],
grp %in% bad_values[[2]] ~ good_values[[2]],
TRUE ~ grp)))
I could see it working for your reprex but possibly not the greater problem.

A base R option if you have lot of good_values and bad_values and it is not possible to check each one individually.
lapply(df_list, function(x) {
vec = x[['grp']]
mapply(function(p, q) vec[vec %in% p] <<- q ,bad_values, good_values)
transform(x, grp = vec)
})
#[[1]]
# grp measure
#1 a -0.648146527
#2 a -0.004722549
#3 a -0.943451194
#4 b -0.709509396
#5 b -0.719434286
#....
#[[2]]
# grp measure
#1 a 1.03131291
#2 a -0.85558910
#3 a -0.05933911
#4 b 0.67812934
#5 a 3.23854093
#6 a 1.31688645
#7 a 1.87464048
#8 a 0.90100179
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#....
Here, for every list element we extract it's grp column and replace bad_values with corresponding good_values if they are found and return the corrected dataframe.

Reduce number of levels for each factor dplyr approach

I am trying to reduce the number of levels in each factor variable in my data. I want to reduce the number of levels doing 2 operations:
If the number of levels is larger than a cut-off then replace the less frequent levels to a new level until the number of levels has reached the cut-off
Replace levels in a factor with not enough observations to a new level
I wrote a function which works fine, but I don't like the code. It does not matter if the level REMAIN has not enough observations. I prefer a dplyr approach.
ReplaceFactor <- function(data, max_levels, min_values_factor){
# First make sure that not to many levels are in a factor
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(length(levels(data[[i]])) > max_levels){
levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
data[!get(i) %in% levels_keep, (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
# Now make sure that in each level has enough observations
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(min(table(data[[i]])) < min_values_factor){
levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
return(data)
}
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"),
B = 1:9,
C = c("A","A","B","B","C","C","C","D","D"),
D = c("A","B","E", "E", "E","E","E", "E", "E"))
str(df)
'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
dt2 <- ReplaceFactor(data = data.table(df),
max_levels = 3,
min_values_factor = 2)
str(dt2)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
- attr(*, ".internal.selfref")=<externalptr>
dt2
A B C D
1: A 1 A REMAIN
2: A 2 A REMAIN
3: B 3 REMAIN E
4: B 4 REMAIN E
5: C 5 C E
6: C 6 C E
7: C 7 C E
8: C 8 REMAIN E
9: C 9 REMAIN E

Using forcats:
library(dplyr)
library(forcats)
max_levels <- 3
min_values_factor <- 2
df %>%
mutate_if(is.factor, fct_lump, n = max_levels,
other_level = "REMAIN", ties.method = "first") %>%
mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.),
other_level = "REMAIN")
# A B C D
# 1 A 1 A REMAIN
# 2 A 2 A REMAIN
# 3 B 3 B E
# 4 B 4 B E
# 5 C 5 C E
# 6 C 6 C E
# 7 C 7 C E
# 8 C 8 REMAIN E
# 9 C 9 REMAIN E
(Oh, and I wasn't able to replicate the exact behavior of your function, but you might get what you want by tweaking ties.method and substracting 1 to max_levels).

Keep Empty Factor Levels During Aggregation [duplicate]

This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 8 years ago.
I'm using data.table to aggregate values, but I'm finding that when the "by" variable has a level not present in the aggregation, it is omitted, even if it is specified in the factor levels.
The code below generates a data.table with 6 rows, the last two of which only have one of the two possible levels for f2 nested within f1. During aggregation, the {3,1} combination is dropped.
set.seed(1987)
dt <- data.table(f1 = factor(rep(1:3, each = 2)),
f2 = factor(sample(1:2, 6, replace = TRUE)),
val = runif(6))
str(dt)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ f1 : Factor w/ 3 levels "1","2","3": 1 1 2 2 3 3
$ f2 : Factor w/ 2 levels "1","2": 1 2 2 1 2 2
$ val: num 0.383 0.233 0.597 0.346 0.606 ...
- attr(*, ".internal.selfref")=<externalptr>
dt
f1 f2 val
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 0.6058819
6: 3 2 0.7437177
dt[, sum(val), by = list(f1, f2)] # output is missing a row
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 1.3495996
# this is the output I'm looking for:
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 1 0.0000000 # <- the missing row from above
6: 3 2 1.3495996
Is there a way to achieve this behavior?

Why do you expect that data.table will compute sums for all combinations of f1 and f2?
If this what you want you should add missings rows to the original data before grouping sum. For example:
setkey(dt, f1, f2)
# omit "by = .EACHI" in data.table <= 1.9.2
dt[CJ(levels(f1), levels(f2)), sum(val, na.rm=T),
allow.cartesian = T, by = .EACHI]
## f1 f2 V1
## 1: 1 1 0.3829077
## 2: 1 2 0.2327311
## 3: 2 1 0.3456710
## 4: 2 2 0.5965087
## 5: 3 1 0.0000000 ## <- your "missing row" :)
## 6: 3 2 1.3495996

Delete all rows coresponding to given ID

Data overview:
> str(dataStart[c("gvkey","DEF","FittedRob","NewCol")])
'data.frame': 1000 obs. of 4 variables:
$ gvkey : int 1004 1004 1004 1004 1004 1021 1021 1021 1021 1033 ...
$ DEF : int 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0...
$ FittedRob : num 0.549 0.532 0.519 0.539 0.531 ...
$ NewCol : chr 0.549 "Del" 0.519 0.539 "Del2" ...
Now, I would like to delete all rows where "Del" or "Del2" occur and thats for given "gvkey".
dataStart <- NewDataFrame[ ! NewDataFrame$NewCol %in% c("Del","Del2"),]
Where NewDataFrame is the data.frame containing the NewCol. This however deletes only the rows where "Del" and "Del2" occur, I would like to have deleted the whole "gvkey" if either "Del" or "Del2" occur. Thanks.

You first have to select all gvkey's you want to delete:
keys_to_delete <- unique(NewDataFrame$gvkey[NewDataFrame$NewCol %in%
c("Del","Del2")])
And then use these to delete the corresponding rows:
dataStart <- NewDataFrame[!(NewDataFrame$gvkey %in% keys_to_delete), ]

set.seed(42)
DF <- data.frame(a = sample(c("a", "b", "c"), 10, T), b = sample(1:10, 10, T))
# a b
# 1 c 5
# 2 c 8
# 3 a 10
# 4 c 3
# 5 b 5
# 6 b 10
# 7 c 10
# 8 a 2
# 9 b 5
# 10 c 6
library(plyr)
res <- ddply(DF, .(a), transform, test = any(b %in% c(2, 3)))
res[!res$test, 1:2]
# a b
# 3 b 5
# 4 b 10
# 5 b 5

Use a bit of ave action, using the example data #Roland used:
DF[ave(DF$b,DF$a, FUN=function(x) !any(x %in% c(2,3)))==1,]
And an adaptation of Jan's nice answer:
DF[!DF$a %in% unique(DF$a[DF$b %in% c(2,3)]) ,]
Both giving:
a b
5 b 5
6 b 10
9 b 5

How do I drop a factor level with no observations? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have a data frame with several variables that I'm running a mixed model on using lme(). One of the variables, ForAgeCat, has five factor levels: 1,2,3,4,5.
str(mvthab.3hr.fc$ForAgeCat)
>Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
The problem is that factor level 3 actually doesn't exist, that is, in this dataset (which is a subset of a larger dataset) there are no observations from factor level 3, which I think is messing with my modeling in lme(). Can someone help me to remove/eliminate factor level 3 from the list of factor levels?

use the function droplevels, like so:
> DF$factor_var = droplevels(DF$factor_var)
More detail:
> # create a sample dataframe:
> col1 = runif(10)
> col1
[1] 0.6971600 0.1649196 0.5451907 0.9660817 0.8207766 0.9527764
0.9643410 0.2179709 0.9302741 0.4195046
> col2 = gl(n=2, k=5, labels=c("M", "F"))
> col2
[1] M M M M M F F F F F
Levels: M F
> DF = data.frame(Col1=col1, Col2=col2)
> DF
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
6 0.953 F
7 0.964 F
8 0.218 F
9 0.930 F
10 0.420 F
> # now filter DF so that only *one* factor value remains
> DF1 = DF[DF$Col2=="M",]
> DF1
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 2 levels "M","F": 1 1 1 1 1
> # but still 2 factor *levels*, even though only one value
> DF1$Col2 = droplevels(DF1$Col2)
> # now Col2 has only a single level:
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 1 level "M": 1 1 1 1 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sorting in dplyr produces incorrect output - r

Related

recode/replace multiple values in a shared data column to a single value across data frames

Reduce number of levels for each factor dplyr approach

Keep Empty Factor Levels During Aggregation [duplicate]

Delete all rows coresponding to given ID

How do I drop a factor level with no observations? [duplicate]

Categories

Resources