Select values when column names are stored as concatenated strings - r

It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:
set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')
df <- data.frame(A,B,C,tmp)
A B C tmp
1 3 4 6 A
2 2 8 8 B,C
3 5 3 2 C
4 0 5 3 A,B
5 4 1 7 A,B,C
Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.
E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.
Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.
Desired result:
A B C tmp
1 3 4 6 A
2 5 3 2 C
How can I perform such filtering?

This is a bit more complicated than I like and there might be a more elegant solution, but here we go:
#split tmp
col <- strsplit(df[["tmp"]], ",")
#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)
#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)
df[as.logical(sel),]
# A B C tmp
#1 3 4 6 A
#3 5 3 2 C

Not sure if it works always (and probably isn't the best solution)... but it worked here:
library(dplyr)
library(tidyr)
library(stringr)
List= vector("list")
for (i in 1:length(df)){
tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
selec= df %>%
select(tmpT) %>%
slice(which(row_number() == i)) %>%
filter_all(., all_vars(. <= 3)) %>%
unite(val, sep= ", ")
if (nrow(selec) == 0) {
tab= NA
} else{
tab= df[i,]
}
List[[i]] = tab
}
df2= do.call("rbind", List)

This answer has some similarities with #Roland's, but here we work with the data in a 'longer' format:
# create row index
df$ri = seq_len(nrow(df))
# split the concatenated column
l <- strsplit(df$tmp, ',')
# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))
# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]
# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
# A B C tmp ri x val
# 1 3 4 6 A 1 A 3
# 3 5 3 2 C 3 C 2

Related

Turn combinations of columns into some kind of interpretable variable

I want to turn combinations of columns into some kind of interpretable variable. There are 3 levels of a factor repeated in three columns, for each id. For all the combinations between the variables I would like to gain a list, and when I have the lsit, I want to know how many times can we find each combination. For example, when q1 and q2 are the same, it should return "A". An then A appear XX times. Anyone with suggestions? Thanks!!
id <- 1:10
set.seed(1)
q1 <- sample(1:3, 10, replace=TRUE)
set.seed(2)
q2 <- sample(1:3, 10, replace=TRUE)
set.seed(2)
q3 <- sample(1:3, 10, replace=TRUE)
df <- data.frame(id,q1,q2,q3)
df
df
id q1 q2 q3
1 1 1 1 1
2 2 2 3 3
3 3 2 2 2
4 4 3 1 1
5 5 1 3 3
6 6 3 3 3
7 7 3 1 1
8 8 2 3 3
9 9 2 2 2
10 10 1 2 2
if df$q1=="1" & df$q2=="1" print A
if df$q1=="1" & df$q2=="2" print B
if df$q1=="1" & df$q2=="3" print C
if df$q1=="2" & df$q2=="3" print D
if df$q1=="2" & df$q2=="2" print E
if df$q1=="3" & df$q2=="3" print F
if df$q2=="1" & df$q2=="1" print G
if df$q2=="1" & df$q2=="2" print H
response <- save(print A, print B, print C and so on....)
length(A)
length(B)
and so on...
I think this should do what you want, using base R. I hope I understood your desired output. I basically combined each pair of columns into its own variable (comb.var[, i]) and then combined that with each column name pair to create another variable output$fct and the relabeled the new variable which represents each q-pair x value-pair combination and counted the occurrence of each combination with summary()
code:
# dimensions of df
n = nrow(df) #rows
p = ncol(df) #columns
# unique pairs of q columns
pairs.n = choose(p - 1, 2) # number of unique pairs
pairs = combn(1:(p - 1), 2) # matrix of those pairs
# data frame of NAs of proper size
comb.var <- matrix(NA, nrow = n, ncol = pairs.n)
for(combo in 1:ncol(pairs)){
i = pairs[1, combo]
j = pairs[2, combo]
# get the right 2 columns from df
qi = df[, i + 1]
qj = df[, j + 1]
# combine into 1 variable
comb.var[, combo] <- paste(qi, qj, sep = "_")
}
# clean up the output: turn out.M into vector and add id columns
output = data.frame(data.frame(id = rep(df$id, times = pairs.n),
qi = rep(pairs[1, ], each = n),
qj = rep(pairs[2, ], each = n),
val = as.vector(comb.var)))
# combine variables again
output$fct = with(output, paste(qi, qj, val, sep = "."))
# count number of different outputs
uniq.n = length(unique(output$fct))
# re-label the factor
output$fct <- factor(output$fct, labels = LETTERS[1:uniq.n])
# count the group members
summary(output$fct)

Conditionally fill new column of data frame on column contents specified in a separate list

I have a dataset (x) that contains an ID column. I need to add a new dataframe column ("Var"). Var rows needs to be conditionally filled with values (0 or 1) if ID matches a list. Var rows that are not matched to a list can be left empty or with NA.
It is important for my analysis that the order of the rows is not disrupted in any way.
x <- data.frame("ID" = 1:10)
list0 <- c(1,8,9)
list1 <- c(2,4,5,7,10)
The desired output is
data.frame("ID"= 1:10, "Var" = c(0,1,"NA",1,1,"NA",1,0,0,1))
Something like:
library(tidyverse)
x %>% mutate(var = case_when(
ID %in% list0 ~ 0,
ID %in% list1 ~ 1,))
or using ifelse
x$Var <- ifelse(x$ID %in% list0, 0, ifelse(x$ID %in% list1, 1, NA))
Which produces the desired output:
ID var
1 1 0
2 2 1
3 3 NA
4 4 1
5 5 1
6 6 NA
7 7 1
8 8 0
9 9 0
10 10 1
An option in base R is
x$var <- +(x$ID %in% c(list0, list1))

R, how to replace only the numeric values of a dataframe?

I am working on R 3.4.3 on Windows 10. I have a dataframe made of numeric values and characters.
I would like to replace only the numeric values but when I do that the characters also change and are replaced.
How can I edit my function to make it affect only the numeric values and not the characters?
Here is the piece of code of my function:
dataframeChange <- function(dFrame){
thresholdVal <- 20
dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Here is a dataframe example:
example_df <- data.frame(
myNums = c (1:5),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
Thanks for the help!
As Tim's comment, you should be aware of the location of the numeric columns which we can locate them using ind <- sapply(dFrame, is.numeric)
dataframeChange <- function(dFrame){
#browser()
thresholdVal <- 20
ind <- sapply(dFrame, is.numeric)
dFrame[(dFrame[,ind] >= thresholdVal),ind] <- -1
#dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Use mutate_if from dplyr:
library(dplyr)
example_df %>% mutate_if(is.numeric, funs(if_else(. >= thresh, repl, .)))
myNums myChars
1 10 A
2 -1 B
3 -1 C
4 5 D
5 -1 E
Explanation:
The mutate family of functions is for variable assignment or updating.
mutate_if functions (specified within funs()) are only applied to columns which satisfy the first argument (in this case, is.numeric())
The updating function is a simple if_else clause based on OP rules.
Data:
thresh <- 20
repl <- -1.0
example_df <- data.frame(
myNums = c(10,20,30,5,70),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
example_df
myNums myChars
1 10 A
2 20 B
3 30 C
4 5 D
5 70 E
Using data.table, we can avoid explicit loops and is faster. Here I've set the threshold value as 2:
# set to data table
setDT(example_df)
# get numeric columns
num_cols <- names(example_df)[sapply(example_df, is.numeric)]
# loop over all columns at once
example_df[,(num_cols) := lapply(.SD, function(x) ifelse(x>2,-1, x)), .SDcols=num_cols]
print(example_df)
myNums myChars
1: 1 A
2: 2 B
3: -1 C
4: -1 D
5: -1 E
Another data.table solution.
library(data.table)
dataframeChange <- function(dFrame){
setDT(dFrame)
for(j in seq_along(dFrame)){
set(dFrame, i= which(dFrame[[j]] < 20), j = j, value = -1)
}
}
dataframeChange_dt(example_df)
example_df
# myNums myChars
# 1: -1 A
# 2: 20 B
# 3: 30 C
# 4: -1 D
# 5: 70 E
It does not explicitly call only numeric columns, however I tested on multiple datasets and it does not effect the non-numeric columns.

bind columns with different number of rows

I want to create iteration that takes a list (which is column of another dataframe) and add it to the current data frame as column. but the length of the columns are not equal. So, I want to generate NA as unmatched rows.
seq_actions=as.data.frame(x = NA)
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
seq_actions=cbind(temp_seq,seq_actions)
}
to simplify, lets say i have
df
1 3
3 4
2 2
adding the list of 5,6 as new column to df, so I want:
df
1 3 5
3 4 6
2 2 NA
another adding list is 7 7 7 8, so my df will be:
df
1 3 5 7
3 4 6 7
2 2 NA 7
NA NA NA 8
How can I do it?
Here's one way. The merge function by design will add NA values whenever you combine data frames and no match is found (e.g., if you have fewer values in 1 data frame than the other data frame).
If you assume that you're matching your data frames (what rows go together) based on the row number, just output the row number as a column in your data frames. Then merge on that column. Merge will automatically add the NA values you want and deal with the fact that the data frames have different numbers of rows.
#test data frame 1
a <- c(1, 3, 2)
b <- c(3, 4, 2)
dat <- as.data.frame(cbind(a, b))
#test data frame 2 (this one has fewer rows than the first data frame)
c <- c(5, 6)
dat.new <- as.data.frame(c)
#add column to each data frame with row number
dat$number <- row.names(dat)
dat.new$number <- row.names(dat.new)
#merge data frames
#"all = TRUE" will mean that NA values will be added whenever there is no match
finaldata <- merge(dat, dat.new, by = "number", all = TRUE)
If you know the maximum possible size of df, and the total number of columns you want to append, you can create df in advance with all NA values and fill a column in based on its length. This would leave everything after its length still NA.
e.g.
max_col_num <- 20
max_col_size <- 10 #This could be the number of rows in the largest dataframe you have
df <- as.data.frame(matrix(ncol = max_col_num, nrow = max_col_size))
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
df[c(1:length(temp_seq), i] <- temp_seq
}
This would only work if you new the total possible number of rows and columns.
I think the best could be to write a custom function which is based on nrow of data frame and length of vector/list.
Once such function can be written as:
#Function to add vector as column
addToDF <- function(df, v){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
cbind(df,v)
}
Let's test above function with data.frame provided by OP.
df <- data.frame(A= c(1,3,2), B = c(3, 4, 2))
v <- c(5,6)
w <-c(7,7,8,9)
addToDF(df, v)
# A B v
# 1 1 3 5
# 2 3 4 6
# 3 2 2 NA
addToDF(df, w)
# A B v
# 1 1 3 7
# 2 3 4 7
# 3 2 2 8
# 4 NA NA 9
Following MKRs response, if you want to to add a specific name to the new added column, you can try:
addToDF <- function(df, v, col_name){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
df_new<-cbind(df,v)
colnames(df_new)[ncol(df_new)]=col_name
return(df_new)
}
where col_name is the new of the added column.

Combine table with different elements

I have items in different lists and I want to count the item in each list and output it to a table. However, I ran into difficulty when there are different items in the list. Too illustrate my problem:
item_1 <- c("A","A","B")
item_2 <- c("A","B","B","B","C")
item_3 <- c("C","A")
item_4 <- c("D","A", "A")
item_5 <- c("B","D")
list_1 <- list(item_1, item_2, item_3)
list_2 <- list(item_4, item_5)
table_1 <- table(unlist(list_1))
table_2 <- table(unlist(list_2))
> table_1
A B C
4 4 2
> table_2
A B D
2 1 2
What I get from cbind is :
> cbind(table_1, table_2)
table_1 table_2
A 4 2
B 4 1
C 2 2
which is clearly wrong. What I need is:
table_1 table_2
A 4 2
B 4 1
C 2 0
D 0 2
Thanks in advance
It would probably be better to use factors at the start if possible, something like:
L <- list(list_1 = list_1,
list_2 = list_2)
RN <- unique(unlist(L))
do.call(cbind,
lapply(L, function(x)
table(factor(unlist(x), RN))))
# list_1 list_2
# A 4 2
# B 4 1
# C 2 0
# D 0 2
However, going with what you have, a function like the following might be useful. I've added comments to help explain what's happening in each step.
myFun <- function(..., fill = 0) {
## Get the names of the ...s. These will be our column names
CN <- sapply(substitute(list(...))[-1], deparse)
## Put the ...s into a list
Lst <- setNames(list(...), CN)
## Get the relevant row names
RN <- unique(unlist(lapply(Lst, names), use.names = FALSE))
## Create an empty matrix. `fill` can be anything--it's set to 0
M <- matrix(fill, length(RN), length(CN),
dimnames = list(RN, CN))
## Use match to identify the correct row to fill in
Row <- lapply(Lst, function(x) match(names(x), RN))
## use matrix indexing to fill in the unlisted values of Lst
M[cbind(unlist(Row),
rep(seq_along(Lst), vapply(Row, length, 1L)))] <-
unlist(Lst, use.names = FALSE)
## Return your matrix
M
}
Applied to your two tables, the outcome is like this:
myFun(table_1, table_2)
# table_1 table_2
# A 4 2
# B 4 1
# C 2 0
# D 0 2
Here's an example with adding another table to the problem. It also demonstrates use of NA as a fill value.
set.seed(1) ## So you can get the same results as me
table_3 <- table(sample(LETTERS[3:6], 20, TRUE) )
table_3
#
# C D E F
# 2 7 9 2
myFun(table_1, table_2, table_3, fill = NA)
# table_1 table_2 table_3
# A 4 2 NA
# B 4 1 NA
# C 2 NA 2
# D NA 2 7
# E NA NA 9
# F NA NA 2
To fix your existing problem, you can put the two tables into a list and add the missing values an names back in. Here, nm is a vector of the table names unique to each table, tbs is a list of the tables, and we can use sapply to append and reorder the missing values.
> nm <- unique(unlist(mget(paste("item", 1:5, sep = "_"))))
> tbs <- list(t1 = table_1, t2 = table_2)
> sapply(tbs, function(x) {
x[4] <- 0L
names(x)[4] <- nm[!nm %in% names(x)]
x[nm]
})
t1 t2
A 4 2
B 4 1
C 2 0
D 0 2
A general solution, for when you have unknowns, and so that you can keep NA values, is
> sapply(tbs, function(x) {
length(x) <- length(nm)
x <- x[match(nm, names(x))]
setNames(x, nm)
})
t1 t2
A 4 2
B 4 1
C 2 NA
D NA 2
But you could have avoided this entirely by going straight from items to table. You put the items into a list and then unlisted them in the very next step. There is a useNA argument in table that will keep the factor levels even when they're zero.
> t1 <- table(c(item_1, item_2, item_3), useNA = "always")
> t2 <- table(c(item_4, item_5), useNA = "always")
> table(c(item_4, item_5), useNA = "always")
A B D <NA>
2 1 2 0
A quick fix to your problem is to make the tables into data frames and then merge them:
d1 <- data.frame(value=names(table_1), table_1=as.numeric(table_1))
d2 <- data.frame(value=names(table_2), table_2=as.numeric(table_2))
merge(d1,d2, all=TRUE)
This will create NA's where you might want 0's. That can be fixed with
M <- merge(d1,d2, all=TRUE)
M[is.na(M)] <- 0

Resources