changing the values in many r variables - r

I want to do the equivalent of find and replace 1=0;2=0;3=0;4=1;5=2;6=3 for many different variables in my data set.
Things I've tried:
making 1=0;2=0;3=0;4=1;5=2;6=3 into a function and using sapply. I changed the ; to , and changed the = to <- and no combination of these were recognized as a function. I tried creating a function with that definition and putting it into sapply and it didn't work.
I tried using recode and it did not work:
wdata[ ,cols2] = recode(wdata[ ,cols2], 1=0;2=0;3=0;4=1;5=2;6=3)

Assuming you are working with a data.frame or matrix you can use direct indexing:
# Sample data
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df;
#V1 V2 V3 V4
#1 6 5 5 3
#2 4 1 1 3
#3 3 3 1 5
#4 2 3 3 6
#5 5 2 3 5
df[df == 1 | df == 2 | df == 3] <- 0;
df[df == 4] <- 1;
df[df == 5] <- 2;
df[df == 6] <- 3;
df;
# V1 V2 V3 V4
#1 3 2 2 0
#2 1 0 0 0
#3 0 0 0 2
#4 0 0 0 3
#5 2 0 0 2
Note that the order of the substitutions matters. For example, df[df == 4] = 1; df[df == 1] <- 0; will give a different output from df[df == 1] <- 0; df[df == 4] <- 1;

Alternative solution using recode from dplyr with sapply or mutate_all:
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df
library(dplyr)
f = function(x) recode(x, `1`=0, `2`=0, `3`=0, `4`=1, `5`=2, `6`=3)
sapply(df, f)
# V1 V2 V3 V4
# [1,] 3 2 2 0
# [2,] 1 0 0 0
# [3,] 0 0 0 2
# [4,] 0 0 0 3
# [5,] 2 0 0 2
df %>% mutate_all(f)
# V1 V2 V3 V4
# 1 3 2 2 0
# 2 1 0 0 0
# 3 0 0 0 2
# 4 0 0 0 3
# 5 2 0 0 2

A looping alternative with lapply and match is as follows:
dat[] <- lapply(dat, function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
This uses a lookup table on the vector c(0,0,0,1,2,3) with match selecting the indices. Using the data.frame created by Maurits Evers, we get
dat
V1 V2 V3 V4
1 3 2 2 0
2 1 0 0 0
3 0 0 0 2
4 0 0 0 3
5 2 0 0 2
To do this for a subset of the columns, just select them on each side, like
dat[, cols2] <-
lapply(dat[, cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
or
dat[cols2] <- lapply(dat[cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])

Related

Add X number of columns to a data.frame

I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter

how subset rows that have value larger than other values for multiple columns in R

I have the following data.table
library(data.table)
dt <- data.table(V1=c(1,3,1,0,NA,0),
V2=c(1,0,1,0,1,3),
Q1=c(3,5,10,14,0,3),
Q2=c(0,1,8,NA,0,NA))
and i want to add a new column that will have value 1:
if any of the columns V1,V2 has value larger than 2,
and
if any of the columns Q1,Q2 has value larger than 0
So in the end i want to up with something like this:
> dt
V1 V2 Q1 Q2 new
1: 1 1 3 0 0
2: 3 0 5 1 1
3: 1 1 10 8 0
4: 0 0 14 NA 0
5: NA 1 0 0 0
6: 0 3 3 NA 1
EDIT
In principle i would like to have 2 vectors of column names, so something like v_columms <- names(dt)[names(dt) %like%"V"] and q_columms <- names(dt)[names(dt) %like%"q"] and use these
We can use melt to process multiple columns by specifying the patterns in measure to convert it to 'long' format and then apply the condition
dt[, new := melt(dt, measure = patterns("V", "Q"))[,
+(any(value1 > 2) & any(value2 > 0)),rowid(variable)]$V1]
dt
# V1 V2 Q1 Q2 new
#1: 1 1 3 0 0
#2: 3 0 5 1 1
#3: 1 1 10 8 0
#4: 0 0 14 NA 0
#5: NA 1 0 0 0
#6: 0 3 3 NA 1
Or without melt, if there are only two groups of columns, then
vs <- grep("V", names(dt))
qs <- grep("Q", names(dt))
dt[, new := +(Reduce(`|`, lapply(.SD[, ..vs], `>`, 2)) &
Reduce(`|`, lapply(.SD[, ..qs], `>`, 0)))]
Using dplyr and either case_when or if_else:
dt %>%
mutate(new = case_when((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2) > 0 ~ 1,
TRUE ~ 0))
dt %>%
mutate(new = if_else((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2 > 0), 1 , 0))
V1 V2 Q1 Q2 new
1 1 1 3 0 0
2 3 0 5 1 1
3 1 1 10 8 0
4 0 0 14 NA 0
5 NA 1 0 0 0
6 0 3 3 NA 1
Here's another approach with some helper functions:
foo <- function(.dt, cols, vals, na.rm = TRUE) {
rowSums(.dt[, cols, with=FALSE] > vals, na.rm = na.rm) > 0
}
bar <- function(.dt, cols_list, vals_list) {
as.integer(Reduce("&", Map(function(cols, vals) foo(.dt, cols, vals), cols_list, vals_list)))
}
dt[, new := bar(.SD, list(v_columms, q_columms), list(2, 0))]

combine tables into a data frame

How do I turn a list of tables into a data frame?
I have:
> (tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b'))))
[[1]]
a b
2 1
[[2]]
b c
1 2
[[3]]
< table of extent 0 >
[[4]]
b
2
I want:
> data.frame(a=c(2,0,0),b=c(1,1,2),c=c(0,2,0))
a b c
1 2 1 0
2 0 1 2
3 0 0 0
4 0 2 0
PS. Please do not assume that the tables were created by table calls! They were not!
c_names <- unique(unlist(sapply(tabs, names)))
df <- do.call(rbind, lapply(tabs, `[`, c_names))
colnames(df) <- c_names
df[is.na(df)] <- 0
This assumes the tables are one dimensional.
all.names <- unique(unlist(lapply(tabs, names)))
df <- as.data.frame(do.call(rbind,
lapply(
tabs, function(x) as.list(replace(c(x)[all.names], is.na(c(x)[all.names]), 0))
) ) )
names(df) <- all.names
df
There is probably a cleaner way to do this.
# a b c
# 1 2 1 0
# 2 0 1 2
# 3 0 0 0
# 4 0 2 0
tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b')))
dat.names <- unique(unlist(sapply(tabs, names)))
dat <- matrix(0, nrow = length(tabs), ncol = length(dat.names))
colnames(dat) <- dat.names
for (ii in 1:length(tabs)) {
dat[ii, ] <- tabs[[ii]][match(colnames(dat), names(tabs[[ii]]) )]
}
dat[is.na(dat)] <- 0
> dat
a b c
[1,] 2 1 0
[2,] 0 1 2
[3,] 0 0 0
[4,] 0 2 0
Here is a pretty clean approach:
library(reshape2)
newTabs <- melt(tabs)
newTabs
# Var1 value L1
# 1 a 2 1
# 2 b 1 1
# 3 b 1 2
# 4 c 2 2
# 5 b 2 4
newTabs$L1 <- factor(newTabs$L1, seq_along(tabs))
dcast(newTabs, L1 ~ Var1, fill = 0, drop = FALSE)
# L1 a b c
# 1 1 2 1 0
# 2 2 0 1 2
# 3 3 0 0 0
# 4 4 0 2 0
This makes use of the fact that there is a melt method for lists (see reshape2:::melt.list) which automatically adds in a variable (L1 for an unnested list) that identifies the index of the list element. Since your list has some items which are empty, they won't show up in your melted list, so you need to factor the "L1" column, specifying the levels you want. dcast takes care of restructuring your output and allows you to specify the desired fill value.

multiply multiple column and find sum of each column for multiple values

I'm trying to multiply column and get its names.
I have a data frame:
v1 v2 v3 v4 v5
0 1 1 1 1
0 1 1 0 1
1 0 1 1 0
I'm trying to multiplying each column with other, like:
v1v2
v1v3
v1v4
v1v5
and
v2v3
v2v4
v2v5
etc, and
v1v2v3
v1v2v4
v1v2v5
v2v3v4
v2v3v5
4 combination and 5 combination...if there is n column then n combination.
I'm try to use following code in while loop, but it is not working:
i<-1
while(i<=ncol(data)
{
results<-data.frame()
v<-i
results<- t(apply(data,1,function(x) combn(x,v,prod)))
comb <- combn(colnames(data),v)
colnames(results) <- apply(comb,v,function(x) paste(x[1],x[2],sep="*"))
results <- colSums(results)
}
but it is not working.
sample out put..
if n=3
v1v2 v1v3 v2v3
0 0 1
0 0 1
0 1 0
and colsum
v1v2 v1v3 v2v3
0 1 2
then
v1v2=0
v1v3=1
v2v3=2
this one is I'm trying?
Try this:
df <- read.table(text = "v1 v2 v3 v4 v5
0 1 1 1 1
0 1 1 0 1
1 0 1 1 0", skip = 1)
df
ll <- vector(mode = "list", length = ncol(df)-1)
ll <- lapply(2:ncol(df), function(ncols){
tmp <- t(apply(df, 1, function(rows) combn(x = rows, m = ncols, prod)))
if(ncols < ncol(df)){
tmp <- colSums(tmp)
}
else{
tmp <- sum(tmp)
}
names1 <- t(combn(x = colnames(df), m = ncols))
names(tmp) <- apply(names1, 1, function(rows) paste0(rows, collapse = ""))
ll[[ncols]] <- tmp
})
ll
# [[1]]
# V1V2 V1V3 V1V4 V1V5 V2V3 V2V4 V2V5 V3V4 V3V5 V4V5
# 0 1 1 0 2 1 2 2 2 1
#
# [[2]]
# V1V2V3 V1V2V4 V1V2V5 V1V3V4 V1V3V5 V1V4V5 V2V3V4 V2V3V5 V2V4V5 V3V4V5
# 0 0 0 1 0 0 1 2 1 1
#
# [[3]]
# V1V2V3V4 V1V2V3V5 V1V2V4V5 V1V3V4V5 V2V3V4V5
# 0 0 0 0 1
#
# [[4]]
# V1V2V3V4V5
# 0
Edit following comment
The results of the different set of column combinations can then be accessed by indexing (subsetting) the list. E.g. to access the "2 combinations", select the first element of the list, to access the "3rd combination", select the second element of the list, et c.
ll[[1]]
# V1V2 V1V3 V1V4 V1V5 V2V3 V2V4 V2V5 V3V4 V3V5 V4V5
# 0 1 1 0 2 1 2 2 2 1

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources