Exclude variables based on pattern and melt - r

My data look like this
id var1 var1_a var2 var2_a var3 var3_a
1 1 7 7 8 9 4
2 2 4 8 7 6 5
3 5 5 1 2 3 4
4 6 9 5 6 7 8
I want to select var1, var2, and var3 only, and exclude var1_a, var2_a an var3_a. Name of variables may vary in length
I know I can use something like
dt.m<-melt(dt, id=1, measure.vars=c(1, 3, 5), na.rm=TRUE)
but I don't want to use this approach because I have too many of variables.
How ca I do this using patterns or a similar approach?

If the measure column names have a pattern to them then use grep to find which they are. In the example, the variables of interest all end in a digit so we could use this:
melt(dt, id = 1, measure = grep("\\d$", names(dt)), na.rm = TRUE)
or if the columns of interest are in predictable positions use seq or similar approach to generate the column numbers.
melt(dt, id = 1, measure = seq(2, 6, 2), na.rm = TRUE)
Other ways to pick out the names that work in the example are:
# pick out column names that have 4 characters
which(nchar(names(dt)) == 4)
# pick out names having no underscore and that are not first
grep("_", names(dt), invert = TRUE)[-1]
# pick out even positions
which( (1:ncol(dt)) %% 2 == 0)

Sorry I'd comment but I don't have enough rep yet. If your variables are actually named var1 var1_a, etc, you can use gsub
names1 = paste0("var",seq(1,100))
names2 = paste0("var",seq(1,100),"_a")
names = sample(c(names1, names2))
x = matrix(rnorm(200*10),nrow=10)
d = data.frame(x)
names(d) = names
d.m <- d[,which(gsub("_a","",names(d)) == names(d))]
print(names(d.m))

Related

Selecting columns with fixed pattern at beginning and end of name, and variable middle part

I want to select data frame columns that have a certain pattern at the beginning and the end of their names, and one out of several possible values in the middle. This is what works, but I find the double use of intersect not very elegant.
df <- data.frame(var1_one_num = sample(1:10, 10, replace = TRUE),
var1_two_num = sample(1:10, 10, replace = TRUE),
var1_three_num = sample(1:10, 10, replace = TRUE),
var1_four_num = sample(1:10, 10, replace = TRUE),
var2_one_num = sample(1:10, 10, replace = TRUE),
var1_one_fac = sample(1:10, 10, replace = TRUE))
var_middle <- c("one|two|three")
df %>% select(intersect(starts_with("var1_"),
intersect(matches(var_middle),
ends_with("_num")))) %>% names()
[1] "var1_one_num" "var1_two_num" "var1_three_num"
I suspect there is smarter way with any of or similar, but I could not get round it.
Looks like you only need the column names - you can use regular expressions to achieve this:
> grep(pattern = '^var1.*(one|two|three).*num$', x = colnames(df), value = T)
[1] "var1_one_num" "var1_two_num" "var1_three_num"
the ^ sign indicates the string must begin with that pattern, the $ indicates the what the string must end with. The round bracket with | separator indicates that any of these values are acceptable.
To get column values:
> df[, grep(pattern = '^var1.*(one|two|three).*num$', x = colnames(df), value = T)]
var1_one_num var1_two_num var1_three_num
1 9 1 7
2 2 10 4
3 2 9 1
4 1 5 4
5 4 9 10
6 6 8 8
7 9 5 7
8 6 2 6
9 5 3 5
10 1 1 7
If you're unfamiliar with regex, here's a good link to learn more: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
Hope this is helpful!
This is the answer by #tmfmnk which I suggested he post, but he hasn't so far. Since I wanted to something in dplyr, this is what I was looking for:
df %>% select(matches("^var1_(one|two|three)_.*num$"))

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

Fast melted data.table operations

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.
The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.
A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:
input <- data.table(
id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input,
id ~ variable, sum,
subset=.(variable %in% c('x', 'y')))
the output of which is
id x y
1 1 1 5
2 2 4 11
3 3 15 9
Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id in the key might be worth setkey's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.
Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]
> setkey(input, "id")
> input[ , list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 34
> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 24
The last one:
> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
id variable V1
1: 1 x 1
2: 1 y 5
3: 2 x 4
4: 2 y 11
5: 3 x 15
6: 3 y 9
I'm not sure if this is the best way, but you can try:
input[, list(x = sum(value[variable == "x"]),
y = sum(value[variable == "y"])), by = "id"]
# id x y
# 1: 1 1 5
# 2: 2 4 11
# 3: 3 15 9

R: merging copies of the same variable

I have data like this in R:
subjID = c(1,2,3,4)
var1 = c(3,8,NA,6)
var1.copy = c(NA,NA,5,NA)
fake = data.frame(subjID = subjID, var1 = var1, var1 = var1.copy)
which looks like this:
> fake
subjID var1 var1.1
1 1 3 NA
2 2 8 NA
3 3 NA 5
4 4 6 NA
Var1 and Var1.1 represent the same variable, so each subject has NA for one column and a numerical value in the other (no one has two NAs or two numbers). I want to merge the columns to get a single Var1: (3, 8, 5, 6).
Any tips on how to do this?
If you're only dealing with two columns, and there are never two numbers or two NAs, you can calculate the row mean and ignore missing values:
fake$fixed <- rowMeans(fake[, c("var1", "var1.1")], na.rm=TRUE)
You can use is.na, which can be vectorised as:
# get all the ones we can from var1
var.merged = var1;
# which ones are available in var1.copy but not in var1?
ind = is.na(var1) & !is.na(var1.copy);
# use those to fill in the blanks
var.merged[ind] = var1.copy[ind];
It depends on how you want to merge if there are conflicts.
You could simply put all non-NA values in var.1.1 into the corresponding slot of var1. In case of conflicts, this will favour var.1.1.
var1[!is.na(var1.copy)] <- var1.copy[!is.na(var1.copy)]
You could just fill in all NA values in var1 with corresponding values of var1.1. In case of conflict, this will favour var1.
var1[is.na(var1)] <- var1.copy[is.na(var1)]

Creating new variable from three existing variables in R

I have a dataset that looks like the one below, and I would like to create a new variable based on these variables, which can be used with the other variables in the dataset.
The first variable, ID, is a respondent identification number. The med variable are 1 and 2, indicating different treatments. Var1_v1 and Var1_v2 has four real options 1,2,3, or 9, and these options are only given to those who med ==1. If med ==2, NA appears in the Var1s. Var2 receives NA when med ==1 and has real values ranging from 1-3 when med==2.
ID <- c(1,2,3,4,5,6,7,8,9,10,11)
med <- c(1,1,1,1,1,1,2,2,2,2,2)
Var1_v1 <- c(2,2,3,9,9,9,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var1_v2 <- c(9,9,9,1,3,2,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var2 <- c(NA,NA,NA,NA,NA,NA,3,3,1,3,2)
#tables to show you what data looks like relative to med var
table(Var1_v1, med)
table(Var1_v2, med)
table(Var2, med)
I've been looking around for a while to figure out a recoding/new variable creation code, but I have had no luck.
Ultimately, I would like to create a new variable, say Var3, based on three conditions:
Uses the values from Var1_v1 if the value = 1, 2, or 3
Uses the values from Var1_v2 if the value = 1, 2, or 3
uses the values from Var2 if the values = 1, 2, or 3
And this variable should be able to match up with the ID number, so that it can be used within the dataset.
So the final variable should look like:
Var3 <- (2,2,3,1,3,2,3,3,1,3,2)
Thanks!
Something like
v <- Var1_v1
v[Var1_v2 %in% 1:3] <- Var1_v2[Var1_v2 %in% 1:3]
v[Var2 %in% 1:3] <- Var2[Var2 %in% 1:3]
v
[1] 2 2 3 1 3 2 3 3 1 3 2
which uses one of them as a base (you could also use a pure NA vector) and simply fills in only parts that match.

Resources