Related
I noticed an unexpected outcome while using dplyr::arrange together with gtools::mixedorder.
Consider:
library(tidyverse)
test <- data.frame(V1 = c("all13_LG1", "all13_LG10", "all13_LG11",
"all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16",
"all13_LG2", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6",
"all13_LG7", "all13_LG8", "all13_LG9"),
V2 = c(rep(1:16)))
test2 <- test %>% arrange(gtools::mixedorder(V1))
test3 <- test %>% slice(gtools::mixedorder(V1))
In test2 the 1st column is sorted: "all13_LG1", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6", "all13_LG7", "all13_LG8", "all13_LG9", "all13_LG10", "all13_LG11", "all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16", "all13_LG2"
Whereas in test3, the columns are sorted as one would expect when using gtools:mixedorder
Why is this happening when I combine arrange and mixedtools? Is this a bug?
Many thanks,
Anneke
To use results of mixedorder in arrange you need to order the results.
library(dplyr)
test %>% arrange(order(gtools::mixedorder(V1)))
# V1 V2
#1 all13_LG1 1
#2 all13_LG2 9
#3 all13_LG3 10
#4 all13_LG4 11
#5 all13_LG5 12
#6 all13_LG6 13
#7 all13_LG7 14
#8 all13_LG8 15
#9 all13_LG9 16
#10 all13_LG10 2
#11 all13_LG11 3
#12 all13_LG12 4
#13 all13_LG13 5
#14 all13_LG14 6
#15 all13_LG15 7
#16 all13_LG16 8
We can slice with mixedsort and match
library(dplyr)
test %>%
slice(match(gtools::mixedsort(V1), V1))
-output
V1 V2
1 all13_LG1 1
2 all13_LG2 9
3 all13_LG3 10
4 all13_LG4 11
5 all13_LG5 12
6 all13_LG6 13
7 all13_LG7 14
8 all13_LG8 15
9 all13_LG9 16
10 all13_LG10 2
11 all13_LG11 3
12 all13_LG12 4
13 all13_LG13 5
14 all13_LG14 6
15 all13_LG15 7
16 all13_LG16 8
I'd like to use the pmax command to create a new column. My code Looks like this:
Master <- Master %>%
mutate(RAM = pmax(RAM1, RAM2, RAM3, RAM4, RAM5, RAM6, RAM7, RAM8, RAM9, RAM10,
RAM11, RAM12, RAM13, RAM14, RAM15, RAM16, RAM17, RAM18,
RAM19, RAM20, RAM21, RAM22, RAM23, RAM24, RAM25, RAM26,
RAM27, RAM28, RAM29, RAM30, RAM31, RAM32, RAM33, RAM34,
RAM35, RAM36, RAM37, RAM38, RAM39, RAM40, RAM41, RAM42,
RAM43, RAM44, RAM45, RAM46, RAM47, RAM48, RAM49, RAM50,
RAM51, RAM52, RAM53, RAM54, RAM55, RAM56, RAM57, RAM58,
RAM59, RAM60, RAM61, RAM62, RAM63, RAM64, RAM65, RAM66,
RAM67, RAM68, RAM69, RAM70, RAM71, RAM72, RAM73, RAM74,
RAM75, RAM76, RAM77, RAM78, RAM79, RAM80, RAM81, RAM82,
RAM83, RAM84, RAM85, RAM86, RAM87, RAM88, RAM89, RAM90,
RAM91, RAM92, na.rm =T))
In my current data base, however, only the columns RAM1 to RAM8 exist. In this case, I want R to skip all the other columns mentioned in the Statement and to only use column RAM1 to RAM8 (it is okay if R displays an error message, but I don't want the program to interrupt running the code).
Any ideas how to do so?
Thanks!
One way to do this would be as follows:
Set up some data to make a reproducible example
set.seed(0)
Master <- data.frame(Other=100,RAM1=1:10, RAM2=1:10, RAM3=1:10, RAM4=1:10,
RAM5=1:10, RAM6=1:10, RAM7=1:10, RAM8=rnorm(10)+5)
Master[5,5] <- NA
Select required columns of the dataframe:
Master[colnames(Master) %in% paste0("RAM",1:92)]
Use do.call to run pmax using the selected columns as arguments, and adding the argument na.rm=TRUE
Master$RAM <- do.call(pmax, c(Master[colnames(Master) %in% paste0("RAM",1:92)], na.rm=TRUE))
Sample output:
Master
# Other RAM1 RAM2 RAM3 RAM4 RAM5 RAM6 RAM7 RAM8 RAM
#1 100 1 1 1 1 1 1 1 6.262954 6.262954
#2 100 2 2 2 2 2 2 2 4.673767 4.673767
#3 100 3 3 3 3 3 3 3 6.329799 6.329799
#4 100 4 4 4 4 4 4 4 6.272429 6.272429
#5 100 5 5 5 NA 5 5 5 5.414641 5.414641
#6 100 6 6 6 6 6 6 6 3.460050 6.000000
#7 100 7 7 7 7 7 7 7 4.071433 7.000000
#8 100 8 8 8 8 8 8 8 4.705280 8.000000
#9 100 9 9 9 9 9 9 9 4.994233 9.000000
#10 100 10 10 10 10 10 10 10 7.404653 10.000000
I have a product code variable like:
Product Code
RMMI001,
RMMI001,
CMCM009,
ASCMOT064,
ASPMOA023,
CMCM009,
CMCM012,
CMCM001,
ASCMBW001,
RMMI001,
TMHO002,
TMSP001,
TMHO002,
TMDMST003
I need to split those and need these characters in another column.
You may try using sub here to remove all trailing numbers, leaving you with the character portion:
df <- data.frame(product_code=c("RMMI001", "RMMI001", "CMCM009"))
df$code <- sub("\\d*$", "", df$product_code)
df
product_code code
1 RMMI001 RMMI
2 RMMI001 RMMI
3 CMCM009 CMCM
Demo
What about something like this?
# Sample product codes
ss <- c("RMMI001", "RMMI001", "CMCM009", "ASCMOT064", "ASPMOA023", "CMCM009", "CMCM012", "CMCM001", "ASCMBW001", "RMMI001", "TMHO002", "TMSP001", "TMHO002", "TMDMST003")
# Separate code and numbers and store in data.frame
read.csv(text = gsub("^([a-zA-Z]+)(\\d+)$", "\\1,\\2", ss), header = F)
# V1 V2
#1 RMMI 1
#2 RMMI 1
#3 CMCM 9
#4 ASCMOT 64
#5 ASPMOA 23
#6 CMCM 9
#7 CMCM 12
#8 CMCM 1
#9 ASCMBW 1
#10 RMMI 1
#11 TMHO 2
#12 TMSP 1
#13 TMHO 2
#14 TMDMST 3
You can use tidyr::extract as well, it works with dataframes only.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first", "second"), "([a-zA-Z]+)(\\d+)" )
Output:
# first second
#1 RMMI 001
#2 CMCM 009
This will extract both the alphabets and numbers in separate columns, if you choose "([a-zA-Z]+)\d+" instead of "([a-zA-Z]+)(\d+)". It will then extract only the first match represented as english words like below. Note the difference here is the capturing group represented by parenthesis.It is used here for capturing the match, in this case these are words and numbers into separate columns.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first"), "([a-zA-Z]+)\\d+" )
# first
# 1 RMMI
# 2 CMCM
I have a dataset containing several variables and I wish to statistically test the variances (Kruskal-test) for each variable seperately.
My data (df) looks like that: (carbon and nitrogen content for diffrent agricultural managements (see name)).
I have 16 groups (to simplify it, I´d say, I have got 8 groups):
extract of the data
1. List item
name N_cont C_cont agriculture
C_ero 1,064 8,380 1
C_ero 0,961 8,086 1
C_ero 0,977 8,331 1
Ds_ero 1,767 17,443 2
Ds_ero 1,802 18,264 2
Ds_ero 2,083 20,112 2
Ms_ero 1,547 14,380 3
Ms_ero 1,566 15,313 3
Ms_ero 1,505 14,760 3
Md_ero 1,512 14,303 4
Md_ero 1,656 15,331 4
Md_ero 1,500 13,788 4
C_upsl 1,121 10,581 5
C_upsl 1,159 10,460 5
C_upsl 1,223 10,171 5
Ds_upsl 1,962 20,656 6
Ds_upsl 1,784 16,780 6
Ds_upsl 1,720 17,482 6
Ms_upsl 1,578 16,228 7
Ms_upsl 1,634 15,331 7
Ms_upsl 1,394 13,419 7
Md_upsl 1,286 11,824 8
Md_upsl 1,241 11,452 8
Md_upsl 1,317 11,932 8
I already put a factor for the agriculture
df$agriculture<-factor(df$agriculture)
I can do statistical tests compairing all of the 16 groups.
e.g. kruskal.test(df$C,df$agriculture)
But now I would like to do statistic tests just for specific groups out of the 8 groups, e.g. those which contain e.g. an C (Conventional) or rather DS (Direct seeding) in the name column
or e.g. ero (eroding site) or upsl (upper slope)
It did try grep or split, but it did not work, because the dimension of x and y should be the same.
Do you have any clue?
You can try to subset with grepl. Assuming you want rows whose name contains either DS, upsl or C then
df[grepl("(DS)|(upsl)|(C)", df$name), ]
# name N_cont C_cont agriculture
#1 C_ero 1,064 8,380 1
#2 C_ero 0,961 8,086 1
#3 C_ero 0,977 8,331 1
#13 C_upsl 1,121 10,581 5
#14 C_upsl 1,159 10,460 5
#15 C_upsl 1,223 10,171 5
#16 Ds_upsl 1,962 20,656 6
#17 Ds_upsl 1,784 16,780 6
#18 Ds_upsl 1,720 17,482 6
#19 Ms_upsl 1,578 16,228 7
#20 Ms_upsl 1,634 15,331 7
#21 Ms_upsl 1,394 13,419 7
#22 Md_upsl 1,286 11,824 8
#23 Md_upsl 1,241 11,452 8
#24 Md_upsl 1,317 11,932 8
If you do not want to hard code the name values , you can also try,
x <- c("C", "DS", "upsl")
df[grepl(paste0(x, collapse = "|"), df$name), ]
which would also yield the same result.
Load the data.table package.
library(data.table)
Create a subset of the group you want to do your stats on:
if your dataframe is df, then
DT<-data.table(df)
DT[like(name,"C_")]
.. OR use the sqldf package:
library(sqldf)
sqldf("select * from df where name like 'C_'")
I have two columns of values like this:
>bb
GDis BDis
1 12.291488 8.009909
2 11.283319 13.625103
3 6.674549 8.629232
4 13.493121 17.175888
5 9.550731 9.867878
6 9.193895 9.785301
7 10.541702 10.941371
8 9.849527 9.496284
9 8.682287 8.133774
10 8.439381 4.335260
I need to add extra column and call it Index which calculates ratio GDis/BDis if GDis is bigger, and BDis/GDis if BDis is bigger.
How do I do that?
You can use pmax and pmin.
transform(bb, Index = pmax(GDis, BDis) / pmin(GDis, BDis))
You can also use arithmetics:
transform(bb, Index = (GDis / BDis) ^ (1 - 2 * (BDis > GDis)))
The result:
GDis BDis Index
1 12.291488 8.009909 1.534535
2 11.283319 13.625103 1.207544
3 6.674549 8.629232 1.292856
4 13.493121 17.175888 1.272937
5 9.550731 9.867878 1.033207
6 9.193895 9.785301 1.064326
7 10.541702 10.941371 1.037913
8 9.849527 9.496284 1.037198
9 8.682287 8.133774 1.067436
10 8.439381 4.335260 1.946684
Try
transform(bb, Index=ifelse(GDis>BDis, GDis/BDis, BDis/GDis))
# GDis BDis adn
#1 12.291488 8.009909 1.534535
#2 11.283319 13.625103 1.207544
#3 6.674549 8.629232 1.292856
#4 13.493121 17.175888 1.272937
#5 9.550731 9.867878 1.033207
#6 9.193895 9.785301 1.064326
#7 10.541702 10.941371 1.037913
#8 9.849527 9.496284 1.037198
#9 8.682287 8.133774 1.067436
#10 8.439381 4.335260 1.946684
Not as nice as the other answers but how about this?
bb$RATIO=ifelse(bb$GDis>bb$BDis,bb$GDis/bb$BDis,bb$X1/bb$GDis)