Related
I have a multipart function to convert characters in a specified column to numbers, as follows:
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
I then use that function to create new columns in my original dataset of just the values using another function:
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x) df }
Here is an example of the function:
col1 <- c("A", "B", "C", "D", "?")
col2 <- c("A", "A", "A", "D", "?")
col3 <- c("C", "B", "?", "A", "B")
test <- data.frame(col1, col2, col3)
test
coladd(test, "col1")
This works, but I have to feed each column name from my dataset into coladd() one at a time. Is there a way to apply the coladd() function to every column in a dataframe without having to type in each column name?
Thanks again, and sorry for any confusion, this is my first post here.
Using your functions, you can use Reduce.
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x); df }
Reduce(coladd, names(test), test)
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4
Here is how I would do it, though not using your functions.
library(dplyr)
# this is a named vector to serve as your lookup
recode_val <- c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )
test %>%
mutate(across(everything(), list(col = ~ recode_val[.])))
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4
It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:
set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')
df <- data.frame(A,B,C,tmp)
A B C tmp
1 3 4 6 A
2 2 8 8 B,C
3 5 3 2 C
4 0 5 3 A,B
5 4 1 7 A,B,C
Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.
E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.
Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.
Desired result:
A B C tmp
1 3 4 6 A
2 5 3 2 C
How can I perform such filtering?
This is a bit more complicated than I like and there might be a more elegant solution, but here we go:
#split tmp
col <- strsplit(df[["tmp"]], ",")
#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)
#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)
df[as.logical(sel),]
# A B C tmp
#1 3 4 6 A
#3 5 3 2 C
Not sure if it works always (and probably isn't the best solution)... but it worked here:
library(dplyr)
library(tidyr)
library(stringr)
List= vector("list")
for (i in 1:length(df)){
tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
selec= df %>%
select(tmpT) %>%
slice(which(row_number() == i)) %>%
filter_all(., all_vars(. <= 3)) %>%
unite(val, sep= ", ")
if (nrow(selec) == 0) {
tab= NA
} else{
tab= df[i,]
}
List[[i]] = tab
}
df2= do.call("rbind", List)
This answer has some similarities with #Roland's, but here we work with the data in a 'longer' format:
# create row index
df$ri = seq_len(nrow(df))
# split the concatenated column
l <- strsplit(df$tmp, ',')
# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))
# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]
# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
# A B C tmp ri x val
# 1 3 4 6 A 1 A 3
# 3 5 3 2 C 3 C 2
I have 2 dfs like this
df1 <- data.frame("A" = c(1,2,3,4,5), "B" = c(10,20,30,40,50), "C" = c(6,7,8,9,11))
df2 <- data.frame("A" = c(10,4,30,20,50), "B" = c(1,40,3,7,5)), "C" = c(12,13,14,15))
I want to find if there is a row in df 1 that == a row in df2 for the columns A and B. You can see that df1[4,1:2] == df2[2,1:2]
I tried
for (i in 1:5){
if (for (j in 1:5){
df[i,1:2] == df2[j,1:2]})
print("true")
}
But it gives me this error: Error in if (for (j in 1:5) { : argument is of length zero
You can row-wise paste the values and check duplicates using %in% :
df1[do.call(paste, df1[1:2]) %in% do.call(paste, df2[1:2]),]
# A B C
#4 4 40 9
If you need only single TRUE/FALSE value
any(do.call(paste, df1[1:2]) %in% do.call(paste, df2[1:2]))
#[1] TRUE
If you want to remove rows in df1 which is present in df2 you can use anti_join from dplyr.
dplyr::anti_join(df1, df2, by = c('A', 'B'))
# A B C
#1 1 10 6
#2 2 20 7
#3 3 30 8
#4 5 50 11
To get common rows you can use semi_join/inner_join :
dplyr::semi_join(df1, df2, by = c('A', 'B'))
You can row bind columns A and B and use anyDuplicated():
anyDuplicated(rbind(df1[1:2], df2[1:2])) > 0
[1] TRUE
If there are potential duplicates within data frames you'll need to make them unique first:
anyDuplicated(rbind(unique(df1[1:2]), unique(df2[1:2]))) > 0
Here is a solution returning the rows in df1 and df2, columns A and B, that match.
res <- apply(df2[1:2], 1, function(y){
apply(df1[1:2], 1, function(x) all(x == y))
})
which(res, arr.ind = TRUE)
# row col
#[1,] 4 2
w <- which(res, arr.ind = TRUE)
colnames(w) <- c('df1', 'df2')
w
# df1 df2
#[1,] 4 2
An option with data.table join
library(data.table)
setDT(df1)[!df2, on = .(A, B)]
# A B C
#1: 1 10 6
#2: 2 20 7
#3: 3 30 8
#4: 5 50 11
data
df1 <- data.frame("A" = c(1,2,3,4,5), "B" = c(10,20,30,40,50), "C" = c(6,7,8,9,11))
df2 <- data.frame("A" = c(10,4,30,20,50), "B" = c(1,40,3,7,5), "C" = c(12,13,14,15, 15))
Suppose I have a data frame (let's call it DF) that looks like this:
options(stringsAsFactors = F)
letters <- c("A", "B", "C", "D", "E")
value <- c(.44, .54, .21, .102, .002)
test <- c("2", "c(1,4)", "1", "3:4", "c(1,2)")
DF <- data.frame(cbind(letters, value, test))
DF$value <- as.numeric(DF$value)
This is what DF looks like if you were to print it:
#DF
# letters value test
#1 A 0.440 2
#2 B 0.540 c(1,4)
#3 C 0.210 1
#4 D 0.102 3:4
#5 E 0.002 c(1,2)
My main issue is DF$test. For any cell that has more than one value (ie: 3:4, c(1,2)), I would like the the cell to have the formating of X:Y , given that X and Y are numeric values.
Can someone help? Please note that DF$test is a character vector.
Another gsub option that uses 2 gsubs:
DF$test2 <- gsub(",",":", gsub(".*c\\((.*)\\).*", "\\1", DF$test))
DF
# letters value test test2
#1 A 0.440 2 2
#2 B 0.540 c(1,4) 1:4
#3 C 0.210 1 1
#4 D 0.102 3:4 3:4
#5 E 0.002 c(1,2) 1:2
The first gsub extracts everything between the c( and ) and the second gsub replaces any , with :. This would work if you had > 2 numbers in your c(). I.e. c(1,2,3) would become 1:2:3.
With:
tst <- gsub('[c()]','',DF$test)
tst <- strsplit(tst, '[,:]')
DF$test <- sapply(tst, paste0, collapse = ':')
or in one go:
DF$test <- sapply(strsplit(gsub('[c()]','',DF$test), '[,:]'), paste0, collapse = ':')
your data.frame now looks like:
> DF
letters value test
1 A 0.440 2
2 B 0.540 1:4
3 C 0.210 1
4 D 0.102 3:4
5 E 0.002 1:2
The advantage of this is that it also works with strings in DF$test that are longer than 2 numbers.
gsub should get you there
DF$test <- gsub(".+(\\d+).(\\d+).+", "\\1:\\2", DF$test)
We can use str_extract
library(stringr)
DF$test <- sapply(str_extract_all(DF$test, '[0-9]+'), paste, collapse=":")
DF$test
#[1] "2" "1:4" "1" "3:4" "1:2"
Or using base R
DF$test <- sapply(regmatches(DF$test, gregexpr('[0-9]+', DF$test)), paste, collapse=":")
I have items in different lists and I want to count the item in each list and output it to a table. However, I ran into difficulty when there are different items in the list. Too illustrate my problem:
item_1 <- c("A","A","B")
item_2 <- c("A","B","B","B","C")
item_3 <- c("C","A")
item_4 <- c("D","A", "A")
item_5 <- c("B","D")
list_1 <- list(item_1, item_2, item_3)
list_2 <- list(item_4, item_5)
table_1 <- table(unlist(list_1))
table_2 <- table(unlist(list_2))
> table_1
A B C
4 4 2
> table_2
A B D
2 1 2
What I get from cbind is :
> cbind(table_1, table_2)
table_1 table_2
A 4 2
B 4 1
C 2 2
which is clearly wrong. What I need is:
table_1 table_2
A 4 2
B 4 1
C 2 0
D 0 2
Thanks in advance
It would probably be better to use factors at the start if possible, something like:
L <- list(list_1 = list_1,
list_2 = list_2)
RN <- unique(unlist(L))
do.call(cbind,
lapply(L, function(x)
table(factor(unlist(x), RN))))
# list_1 list_2
# A 4 2
# B 4 1
# C 2 0
# D 0 2
However, going with what you have, a function like the following might be useful. I've added comments to help explain what's happening in each step.
myFun <- function(..., fill = 0) {
## Get the names of the ...s. These will be our column names
CN <- sapply(substitute(list(...))[-1], deparse)
## Put the ...s into a list
Lst <- setNames(list(...), CN)
## Get the relevant row names
RN <- unique(unlist(lapply(Lst, names), use.names = FALSE))
## Create an empty matrix. `fill` can be anything--it's set to 0
M <- matrix(fill, length(RN), length(CN),
dimnames = list(RN, CN))
## Use match to identify the correct row to fill in
Row <- lapply(Lst, function(x) match(names(x), RN))
## use matrix indexing to fill in the unlisted values of Lst
M[cbind(unlist(Row),
rep(seq_along(Lst), vapply(Row, length, 1L)))] <-
unlist(Lst, use.names = FALSE)
## Return your matrix
M
}
Applied to your two tables, the outcome is like this:
myFun(table_1, table_2)
# table_1 table_2
# A 4 2
# B 4 1
# C 2 0
# D 0 2
Here's an example with adding another table to the problem. It also demonstrates use of NA as a fill value.
set.seed(1) ## So you can get the same results as me
table_3 <- table(sample(LETTERS[3:6], 20, TRUE) )
table_3
#
# C D E F
# 2 7 9 2
myFun(table_1, table_2, table_3, fill = NA)
# table_1 table_2 table_3
# A 4 2 NA
# B 4 1 NA
# C 2 NA 2
# D NA 2 7
# E NA NA 9
# F NA NA 2
To fix your existing problem, you can put the two tables into a list and add the missing values an names back in. Here, nm is a vector of the table names unique to each table, tbs is a list of the tables, and we can use sapply to append and reorder the missing values.
> nm <- unique(unlist(mget(paste("item", 1:5, sep = "_"))))
> tbs <- list(t1 = table_1, t2 = table_2)
> sapply(tbs, function(x) {
x[4] <- 0L
names(x)[4] <- nm[!nm %in% names(x)]
x[nm]
})
t1 t2
A 4 2
B 4 1
C 2 0
D 0 2
A general solution, for when you have unknowns, and so that you can keep NA values, is
> sapply(tbs, function(x) {
length(x) <- length(nm)
x <- x[match(nm, names(x))]
setNames(x, nm)
})
t1 t2
A 4 2
B 4 1
C 2 NA
D NA 2
But you could have avoided this entirely by going straight from items to table. You put the items into a list and then unlisted them in the very next step. There is a useNA argument in table that will keep the factor levels even when they're zero.
> t1 <- table(c(item_1, item_2, item_3), useNA = "always")
> t2 <- table(c(item_4, item_5), useNA = "always")
> table(c(item_4, item_5), useNA = "always")
A B D <NA>
2 1 2 0
A quick fix to your problem is to make the tables into data frames and then merge them:
d1 <- data.frame(value=names(table_1), table_1=as.numeric(table_1))
d2 <- data.frame(value=names(table_2), table_2=as.numeric(table_2))
merge(d1,d2, all=TRUE)
This will create NA's where you might want 0's. That can be fixed with
M <- merge(d1,d2, all=TRUE)
M[is.na(M)] <- 0