I have a data frame (df) of
A AA B C D
23 1 1,0,0 0,1,0 0
10 0 0,0,0 1 1,1
I would like the following df2
A AA B C D
23 1 1 1 0
10 0 0 1 1
I really don't have any idea how to even begin coding this. But my shot in the dark is
df2 <- df1 %>% filter_at(vars(B, C, D), any_var(. !=0))
I get the following error: error in is.data.frame(x): object 'B' not found
Any help in this matter would be greatly appreciate as I am new to all of this, thanks.
I hope this is what you are looking for:
library(dplyr)
library(tidyr)
df %>%
separate_rows(D) %>%
separate_rows(AA, B, C) %>%
group_by(A) %>%
summarise(across(AA:D, ~ max(.x)))
# A tibble: 2 x 5
A AA B C D
<int> <int> <chr> <chr> <chr>
1 10 0 0 1 1
2 23 1 1 1 0
df$B<-as.character(df$B)
df$B[df$B == "1,0,0"] <- "1"
df$B[df$B == "0,0,0"] <- "0"
df$B<- as.factor(df$B)
df$C<-as.character(df$B)
df$C[df$C == "0,1,0"] <- "1"
df$C<- as.factor(df$B)
df$D<-as.character(df$D)
df$D[df$D == "1,1"] <- "1"
df$D<- as.factor(df$D)
Ok, I think I figured it out. This may not be the most efficient way to code for this, but it does get the job done and changes the values in df to reflect what I wanted for df2.
If I needed to keep the original df and create a new df2 then I would just put this code first.
df2<-df
Here a way with "base R". Explanation: strsplit splits at the comma and max finds the maximum value. Function sapply does vectorization.
df1 <- read.table(text="
A AA B C D
23 1 1,0,0 0,1,0 0
10 0 0,0,0 1 1,1",
header = TRUE)
f <- function(x) {
if (is.character(x)) {
as.numeric(sapply(strsplit(x, ","), max))
} else {
x
}
}
df2 <- sapply(df1, f)
df2
If the type of non-numerical cells needs to be kept, just remove as.numeric(). To be precise: in the above, max works alphabetically at the character level, not with numbers.
As an alternative, one can also do this with regular expressions.
And here a version with regular expressions, I am myself curious how to improve this:
f <- function(x) {
x <- gsub("0,", "", x)
x <- gsub(",0", "", x)
x <- gsub(",[1-9].*", "", x)
as.numeric(x)
}
df2 <- sapply(df1, f)
df2
Related
I have a dataframe with one column and some rows which I want to transform into a vector. The name of the column should be the name of the vector as well. Usually, I create another object doing this:
new_object <- as.vector(df$variable_name)
new_object
Is there a way to keep the variable name as the name of the vector?
(I am asking as I try to build this in a function and need it therefore)
Thank you!
You can use list2env -
df <- data.frame(a = 1:5)
list2env(df, .GlobalEnv)
a
#[1] 1 2 3 4 5
sub <-c("A","A","A","A","B","B","B","B","C","C","C","C")
n<-c(0,1,1,1,0,1,0,1,0,1,0,1)
df <- data.frame(sub, n)
n <- df$n
[1] 0 1 1 1 0 1 0 1 0 1 0 1
You should be able to just simply call the variable name to get the vector.
df <- data.frame(x = c(1, 2, 3))
assign(names(df), df[,1])
x
# [1] 1 2 3
We could use attach
attach(df)
-output
> a
[1] 1 2 3 4 5
data
df <- data.frame(a = 1:5)
It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:
set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')
df <- data.frame(A,B,C,tmp)
A B C tmp
1 3 4 6 A
2 2 8 8 B,C
3 5 3 2 C
4 0 5 3 A,B
5 4 1 7 A,B,C
Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.
E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.
Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.
Desired result:
A B C tmp
1 3 4 6 A
2 5 3 2 C
How can I perform such filtering?
This is a bit more complicated than I like and there might be a more elegant solution, but here we go:
#split tmp
col <- strsplit(df[["tmp"]], ",")
#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)
#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)
df[as.logical(sel),]
# A B C tmp
#1 3 4 6 A
#3 5 3 2 C
Not sure if it works always (and probably isn't the best solution)... but it worked here:
library(dplyr)
library(tidyr)
library(stringr)
List= vector("list")
for (i in 1:length(df)){
tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
selec= df %>%
select(tmpT) %>%
slice(which(row_number() == i)) %>%
filter_all(., all_vars(. <= 3)) %>%
unite(val, sep= ", ")
if (nrow(selec) == 0) {
tab= NA
} else{
tab= df[i,]
}
List[[i]] = tab
}
df2= do.call("rbind", List)
This answer has some similarities with #Roland's, but here we work with the data in a 'longer' format:
# create row index
df$ri = seq_len(nrow(df))
# split the concatenated column
l <- strsplit(df$tmp, ',')
# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))
# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]
# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
# A B C tmp ri x val
# 1 3 4 6 A 1 A 3
# 3 5 3 2 C 3 C 2
This question already has answers here:
How to remove rows with 0 values using R
(2 answers)
Closed 2 years ago.
I want to remove all the rows having either zeros or NAs. In the code below I am selecting numeric variables and then filtering out 0s. Problem here is it does not return character variables along with numeric ones in the final output.
df <- read.table(header = TRUE, text =
"x y z
a 1 2
b 0 3
c 1 NA
d 0 NA
")
df %>% select_if(is.numeric) %>% filter(rowSums(., na.rm = T)!=0)
You can use filter_if :
library(dplyr)
df %>% filter_if(is.numeric, any_vars(. != 0 & !is.na(.)))
# x y z
#1 a 1 2
#2 b 0 3
#3 c 1 NA
Or using base R :
cols <- sapply(df, is.numeric)
df[rowSums(!is.na(df[cols]) & df[cols] != 0) > 0, ]
Another dplyr option could be:
df %>%
rowwise() %>%
filter(any(across(where(is.numeric)) != 0, na.rm = TRUE))
x y z
<fct> <int> <int>
1 a 1 2
2 b 0 3
3 c 1 NA
Following the suggestions written in this new doc page after the release of dplyr version 1.0.0, you can create a helper function to substitute the superseded functions filter_if and any_vars.
Previously, filter() was paired with the all_vars() and any_vars()
helpers. Now, across() is equivalent to all_vars(), and there’s no
direct replacement for any_vars(). However you can make a simple
helper yourself
From now on, this way should be the reference method for this kind of filtering steps.
rowAny <- function(x) {rowSums(x != 0 & !is.na(x)) > 0}
df %>% filter(rowAny(across(where(is.numeric))))
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA
You could simply do
df[rowSums(suppressWarnings(sapply(df, as.double)), na.rm=TRUE) > 0, ]
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA
I am working on R 3.4.3 on Windows 10. I have a dataframe made of numeric values and characters.
I would like to replace only the numeric values but when I do that the characters also change and are replaced.
How can I edit my function to make it affect only the numeric values and not the characters?
Here is the piece of code of my function:
dataframeChange <- function(dFrame){
thresholdVal <- 20
dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Here is a dataframe example:
example_df <- data.frame(
myNums = c (1:5),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
Thanks for the help!
As Tim's comment, you should be aware of the location of the numeric columns which we can locate them using ind <- sapply(dFrame, is.numeric)
dataframeChange <- function(dFrame){
#browser()
thresholdVal <- 20
ind <- sapply(dFrame, is.numeric)
dFrame[(dFrame[,ind] >= thresholdVal),ind] <- -1
#dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Use mutate_if from dplyr:
library(dplyr)
example_df %>% mutate_if(is.numeric, funs(if_else(. >= thresh, repl, .)))
myNums myChars
1 10 A
2 -1 B
3 -1 C
4 5 D
5 -1 E
Explanation:
The mutate family of functions is for variable assignment or updating.
mutate_if functions (specified within funs()) are only applied to columns which satisfy the first argument (in this case, is.numeric())
The updating function is a simple if_else clause based on OP rules.
Data:
thresh <- 20
repl <- -1.0
example_df <- data.frame(
myNums = c(10,20,30,5,70),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
example_df
myNums myChars
1 10 A
2 20 B
3 30 C
4 5 D
5 70 E
Using data.table, we can avoid explicit loops and is faster. Here I've set the threshold value as 2:
# set to data table
setDT(example_df)
# get numeric columns
num_cols <- names(example_df)[sapply(example_df, is.numeric)]
# loop over all columns at once
example_df[,(num_cols) := lapply(.SD, function(x) ifelse(x>2,-1, x)), .SDcols=num_cols]
print(example_df)
myNums myChars
1: 1 A
2: 2 B
3: -1 C
4: -1 D
5: -1 E
Another data.table solution.
library(data.table)
dataframeChange <- function(dFrame){
setDT(dFrame)
for(j in seq_along(dFrame)){
set(dFrame, i= which(dFrame[[j]] < 20), j = j, value = -1)
}
}
dataframeChange_dt(example_df)
example_df
# myNums myChars
# 1: -1 A
# 2: 20 B
# 3: 30 C
# 4: -1 D
# 5: 70 E
It does not explicitly call only numeric columns, however I tested on multiple datasets and it does not effect the non-numeric columns.
I have several dataframes in a list, which i want to merge into one big dataframe. The actual list contains several thouands of this dataframes and i am therefore looking for a preferably efficient solution.
The list looks similar to this:
v <- data.frame(answer = c(1,1,1))
rownames(v) <- c("A","B","C")
w <- data.frame(answer = c(1,0,0))
rownames(w) <- c("A","B","D")
x <- data.frame(answer = c(1,1,1))
rownames(x) <- c("A","B","C")
y <- data.frame(answer = c(0,0,0))
rownames(y) <- c("A","C","D")
z <- data.frame(answer = c(0,0,0,1))
rownames(z) <- c("A","B","C","D")
l <- list(v,w,x,y,z)
names(l) <- c("V","W","X","Y","Z")
The final output should look like this:
v W X Y Z
A 1 1 1 0 0
B 1 0 1 NA 0
C 1 NA 1 0 0
D NA 0 NA 0 1
What i have tried already (feel free to ignore this part, if you already have a working solution)
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T),stringsAsFactors=FALSE)
and
df <- do.call(rbind.data.frame, l)
and
df<- rbindlist(l) (from library("data.frame"))
Those all loose the information contained in the rownames and only seemed to work if all dataframes have the same length and the same order.
The only one that kinda worked with my actual data was something along the lines of:
df<- suppressWarnings(Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by =
"answer", all = TRUE),l))
but i am not able to make it work with my example list and even when it worked it was extremly unefficiently and took ages once the list got longer.
Here is a base R solution using merge and Reduce:
df <- Reduce(
function(x, y) merge(x, y, by = "id", all = T),
lapply(l, function(x) { x$id <- rownames(x); x }))
colnames(df) <- c("id", names(l))
# id V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
We create a row names column and then do the join. We loop through the list with map, create a row names column with rownames_to_column and reduce to a single dataset by doing a full_join by the row names and rename the column names if needed
library(tidyverse)
l %>%
map( ~ .x %>%
rownames_to_column('rn')) %>%
reduce(full_join, by = 'rn') %>%
rename_at(2:6, ~ names(l))
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
Or another option is to bind_rows and then spread
l %>%
map(rownames_to_column, 'rn') %>%
bind_rows(.id = 'grp') %>%
spread(grp, answer)
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
One way of doing this using something similar to what kind of already worked for you is to first declare the rownames as a variable, then rename the columns of your data frames to match their names in the list, and then merge.
df_l <- l %>% Map(setNames, ., names(.)) %>%
map(~mutate(., r=rownames(.))) %>%
Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="r"), .)
rownames(df_l) <- df_l$r
df_l$r <- NULL
To be honest, I'm not sure it is efficient though, and like you said it will probably take long as the list grows.