Remove Columns with Grep and Piping - r

Using the example data:
d1<-data.frame(years=c("1","5","10"),group.x=c(1:3),group.b=c(1:3),group.2x=c(1:3))
I removed the columns using the following:
d2<-d1[,-grep("\\.x$",colnames(d1))]
Is there a similar way to accomplish the same using piping instead?
d2<- d1 %>%
filter(!grep("\\.x$",colnames()))
returns: Error: argument "x" is missing, with no default
The goal is to remove the columns ending with ".x"

In dplyr, filter will remove rows and select is used to remove/select columns which is what you want here. You can stick with grep but dplyr also offers some specialized functions like in this case ends_with:
library(dplyr)
d1 %>%
select(-ends_with(".x"))
# years group.b group.2x
#1 1 1 1
#2 5 2 2
#3 10 3 3
Take a look at help("select") to find out more about other "special functions" you can use inside dplyr::select.

Simply leave d1 in the select as follows:
d1 %>%
select(-grep("\\.x$",colnames(d1)))
Results in :
years group.b group.2x
1 1 1
5 2 2
10 3 3

Related

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

How to remove multiple columns from a dataframe with column list [duplicate]

This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 4 years ago.
I have a variable list with column names and a dataframe . I would like to remove columns from the dataframes when the column names match the variable list.
columns -> "a","c"
dataframe->
a b c d
0 0 1 1
1 1 1 1
Ouput->
b d
0 1
1 1
Please help me out with the solution.
select_ is deprecated as of dplyr 0.7. See the select_ docs for more info.
I believe the new recommended approach is to use a select helper verbs.
Using shadow 's example. it would be:
select(dataframe, -one_of(c("a", "b"))
Update: Anders Swanson pointed out that you can now use select with standard evaluation. So the following works:
select(dataframe, -columns)
Previous version
You can use select_ together with '-' as shown below:
# create data
columns <- c("a","c")
dataframe <- read.table(text="a b c d
0 0 1 1
1 1 1 1 ", header = TRUE)
# load dplyr package
require(dplyr)
# select columns
select_(dataframe, .dots = paste0("-", columns))

Flatten list column in data frame with ID column

My data frame contains the output of a survey with a select multiple question type. Some cells have multiple values.
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
a b
1 1 1
2 2 1, 2
3 3 1, 2, 3
I would like to flatten out the list to obtain the following output:
df
a b
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
should be easy but somehow I can't find the search terms. thanks.
You can just use unnest from "tidyr":
library(tidyr)
unnest(df, b)
# a b
# 1 1 1
# 2 2 1
# 3 2 2
# 4 3 1
# 5 3 2
# 6 3 3
Using base R, one option is stack after naming the list elements of 'b' column with that of the elements of 'a'. We can use setNames to change the names.
stack(setNames(df$b, df$a))
Or another option would be to use unstack to automatically name the list element of 'b' with 'a' elements and then do the stack to get a data.frame output.
stack(unstack(df, b~a))
Or we can use a convenient function listCol_l from splitstackshape to convert the list to data.frame.
library(splitstackshape)
listCol_l(df, 'b')
Here's one way, with data.table:
require(data.table)
data.table(df)[,as.integer(unlist(b)),by=a]
If b is stored consistently, as.integer can be skipped. You can check with
unique(sapply(df$b,class))
# [1] "numeric" "integer"
Here's another base solution, far less elegant than any other solution posted thus far. Posting for the sake of completeness, though personally I would recommend akrun's base solution.
with(df, cbind(a = rep(a, sapply(b, length)), b = do.call(c, b)))
This constructs the first column as the elements of a, where each is repeated to match the length of the corresponding list item from b. The second column is b "flattened" using do.call() with c().
As Ananda Mahto pointed out in a comment, sapply(b, length) can be replaced with lengths(b) in the most recent version of R (3.2, if I'm not mistaken).
A base R approach might also be to create a new data.frame for each row and rbind it afterwards:
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
df <- lapply(seq_along(df$a), function(x){data.frame(a = df$a[[x]], b = df$b[[x]])})
df <- do.call("rbind", df)
df

Reshape data frame from wide to long with re-occuring column names in R

I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.
To reproduce my starting data frame:
conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"
The data frame looks like the following:
When I run the following melt function:
melt.df<-melt(df,id="conversion.id")
It drops the interaction.num == 2 column and looks something like this:
The data frame I want is the following:
I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.
How to reshape a dataframe with "reoccurring" columns?
And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.
Any advice would be greatly appreciated!
Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.
library(tidyr)
library(dplyr)
var_name <- "interaction.num"
problem_var <- df %>%
names %>%
equals(var_name) %>%
which
replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))
names(df)[problem_var] <- replaced_names
df %>%
gather_("prob_var_name",var_name,replaced_names) %>%
select(-prob_var_name)
conversion.id conversion interaction.num
1 1 1 1
2 2 1 1
3 3 1 1
4 1 1 2
5 2 1 2
6 3 1 2
Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?
Here's a solution using data.table. You just have to provide the index instead of names.
require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3,
value.name="interaction.num")[, variable := NULL]
# conversion.id conversion interaction.num
# 1: 1 1 1
# 2: 2 1 1
# 3: 3 1 1
# 4: 1 1 2
# 5: 2 1 2
# 6: 3 1 2
You can get the indices 2:3 by doing grep("interaction.num", names(df)).
Here's an approach in base R that should work for you:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")
## Reshape
reshape(df, direction = "long",
idvar=c("conversion.id", "conversion"),
varying = x, sep = "_")
# conversion.id conversion time interaction.num
# 1.1.1 1 1 1 1
# 2.1.1 2 1 1 1
# 3.1.1 3 1 1 1
# 1.1.2 1 1 2 2
# 2.1.2 2 1 2 2
# 3.1.2 3 1 2 2
Another possibility is stack instead of reshape:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))
The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.

Remove last N rows in data frame with the arbitrary number of rows

I have a data frame and I want to remove last N rows from it.
If I want to remove 5 rows, I currently use the following command, which in my opinion is rather convoluted:
df<- df[-seq(nrow(df),nrow(df)-4),]
How would you accomplish task, is there a convenient function that I can use in R?
In unix, I would use:
tac file | sed '1,5d' | tac
head with a negative index is convenient for this...
df <- data.frame( a = 1:10 )
head(df,-5)
# a
#1 1
#2 2
#3 3
#4 4
#5 5
p.s. your seq() example may be written slightly less(?) awkwardly using the named arguments by and length.out (shortened to len) like this -seq(nrow(df),by=-1,len=5).
This one takes one more line, but is far more readable:
n<-dim(df)[1]
df<-df[1:(n-5),]
Of course, you can do it in one line by sticking the dim command directly into the re-assignment statement.
I assume this is part of a reproducible script, and you can retrace your steps... Otherwise, strongly recommend in such cases to save to a different variable (e.g., df2) and then remove the redundant copy only after you're sure you got what you wanted.
Adding a dplyr answer for completeness:
test_df <- data_frame(a = c(1,2,3,4,5,6,7,8,9,10),
b = c("a","b","c","d","e","f","g","h","i","j"))
slice(test_df, 1:(n()-5))
## A tibble: 5 x 2
# a b
# <dbl> <chr>
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
Another dplyr answer which is even more readable:
df %>% filter(row_number() <= n()-5)

Resources