Related
I have a vector. On the hand I want to remove factors, which seem to be classified not correct. For instance the "D" at position 7. As the surroundings are "A" this should be "A" too. I know there must be a rule, for example, if the 3 values before and after an outlier are different it is converged- in this case "D" to "A" , otherwise it is removed like the "C" on position 22.
Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")
Var= as.factor(Var)
Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
"1","1","1","1","1")
df<- data.frame (Var, Var2)
Additionally, I want to count the occurences for each variable, if it occurs. So I do nit want to count the occurences in the whole vector, but a list like this. Ideally with the corrected values.
# Var Occurence
#1 A 6
#2 D 1
#3 A 4
#4 B 10
#5 C 1
#6 B 2 ...
I only get to count the values for the whole vector to get with
table (Var)
By the following code I get a column, which start counting each time the "Var" changes.
df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))
This may be easier with data.table. Do a grouping by the rleid (run-length-id) of the 'Var', and get the count (.N), then remove the outlier observations by creating a logical expression in i (from the boxplot outliers)
library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
!N %in% boxplot(N, plot = FALSE)$out]
-output
Var N
1: A 6
2: D 1
3: A 4
4: B 10
5: C 1
6: B 2
7: C 10
8: D 8
9: A 12
10: B 12
11: C 16
12: D 5
rleid can take multiple input columns as the first argument is variadic (...) - from ?rleid
rleid(..., prefix=NULL)
... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.
Therefore, if we have multiple columns, either specify the columns or may use rleidv and the subset of data.frame/data.table as input
setDT(df)[, .N, .(Var, Var2, grp = rleid(Var, Var2))][,
grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]
I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.
Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.
read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)
One option would be to
read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)
# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8
DATA
df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")
Data frame = qog_std3
factor = btid
I am trying to collapse this ordinal level factor using following code:
I get the following error message:
Error: unexpected '=' in:
"btid4 <- fct_collapse(qog_std3$btid,
1="
Can anyone explain to me why the use of "=" provides this error and what I can do about it?
Any alternative solution would also be deeply appreciated.
If the column is factor or character, we need to quote the name especically when it is numeric. It is not an issue when it is non-numeric
fct_collapse(df1$btid, "1" = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
It can be also backquotes
fct_collapse(df1$btid, `1` = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
whereas if we specify the unquoted numeric value
fct_collapse(df1$btid, 1 = c("1", "2"))
Error: unexpected '=' in " fct_collapse(df1$btid, 1 ="
However, this is not an issue when it is character
fct_collapse(df1$id2, AB = c("A", "B"))
#[1] AB AB C D AB AB C AB
#Levels: AB C D
data
df1 <- structure(list(btid = c("1", "1", "3", "3", "4", "5", "1", "2"
), id2 = c("A", "B", "C", "D", "A", "B", "C", "A")), row.names = c(NA,
-8L), class = "data.frame")
I have a data frame arranged as follows:
df <- structure(list(name1= c("A","A","B"),
name2 = c("B", "C","C"),
size = c(10,20,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3"), class =("data.frame"))
I would like to add "mirror" observations as follows:
df <- structure(list(name1 = c("A","B","A", "C", "B", "C"),
name2 = c("B", "A","C", "A", "C", "B"),
size = c(10,10,20,20,30,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3", "4", "5", "6"), class =("data.frame"))
Inputs would be much appreciated.
We can do this in two steps,
df1 <- df[rep(rownames(df), each = 2),]
df1[c(FALSE, TRUE), 1:2] <- df1[c(FALSE, TRUE), 2:1]
df1
# name1 name2 size
#1 A B 10
#1.1 B A 10
#2 A C 20
#2.1 C A 20
#3 B C 30
#3.1 C B 30
We can do
library(data.table)
rbindlist(list(df, df[c(2:1, 3)]))
I am asking a side-question about the method I learned here from #redmode :
Subsetting based on values of a different data frame in R
When I try to dynamically adjust the level I want to subset by:
N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > 0.95*B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))
I get an error
Error in FUN(left, right) : non-numeric argument to binary operator.
Can you think of a way I can get rows pertaining to values in A that are greater than 95% of the value in B? Thank you.
Here is code for A and B to play with.
A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "9.4"),
name2 = c("ctrl", "3", "1", "10", "1", "1", "9.4")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
You have some serious formatting issues with your data.
First, columns should be of the same data type, rows should be observations. (not always true, but a very good way to start) Here you have a row called cond, then a row called hour, then a series of classifications I'm guessing. The way you're data is presented to begin with doesn't make much sense and doesn't lend itself to easy manipulation of your data. But all is not lost. This is what I would do:
Reorganize my data:
C <- data.frame(matrix(as.numeric(unlist(A)), ncol=2)[-(1:2), ])
colnames(C) <- c('A.trt', 'A.cntr')
rownames(C) <- LETTERS[1:nrow(C)]
D <- data.frame(matrix(as.numeric(unlist(B)), ncol=2)[-(1:2), ])
colnames(D) <- c('B.trt', 'B.cntr')
(df <- cbind(C, D))
Which gives:
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# B 10 1 1.0 10.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4
Then you're problem is easily solved by:
df[which(df[, 1] > 0.95*df[, 3] & df[, 2] > 0.95*df[, 4]), ]
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4