How to modify some but not all variables of a data frame? - r

Suppose there is a data.frame where some variables are coded as integers:
a <- c(1,2,3,4,5)
b <- as.integer(c(2,3,4,5,6))
c <- as.integer(c(5,1,0,9,2))
d <- as.integer(c(5,6,7,3,1))
e <- c(2,6,1,2,3)
df <- data.frame(a,b,c,d,e)
str(df)
Suppose I want to convert columns b to d to numeric:
varlist <- names(df)[2:4]
lapply(varlist, function(x) {
df$x <- as.numeric(x, data=x)
})
str(df)
does not work.
I tried:
df$b <- as.numeric(b, data=df)
df$c <- as.numeric(c, data=df)
df$d <- as.numeric(d, data=df)
str(df)
which works fine.
Questions:
How do I do this (in a loop or better with lapply, [but I'm a Stata person and as such used to writing loops])?
And more generally: how do I apply any function to a list of variables in a data.frame
(e.g. multiply each variable on the list with some other variable[which is always stays the same,
BONUS: or changes with each variable on the list])?

For the first question you can use sapply:
df[2:4] <- sapply(df[2:4],as.numeric)
for the second you should use mapply. For example to multiply the 3 variables(2 to 4) by some 3 different random scalars:
df[2:4] <- mapply(function(x,y)df[[x]]*y,2:4,rnorm(3))

df[,2:4] <- sapply(df[,2:4], as.numeric)
As for your second question, if you want to say multiply column c by 5
df$c <- df$c * 5
Or any vector the same length as c, maybe a new column multiplying c by d
df$cd <- df$c * df$d

Related

Join list of matrices using xtabs inside for() loop

Say we have a set of matrices of different dimensions, but with common row and column names. We would like to find the element-wise means of the matrices. xtabs() is a convenient function for this.
However, inside of for(), as.table() fails to recognize the expression calling each matrix. Creating a list of the matrices first and then calling each element of that list fails just the same.
MWE: Create matrices:
m1 <- matrix(1:9,nrow=3,ncol=3)
colnames(m1) <- c("A","B","C")
rownames(m1) <- c("A","B","C")
m2 <- matrix(10:18,nrow=3,ncol=3)
colnames(m2) <- c("A","B","C")
rownames(m2) <- c("A","B","C")
m3 <- matrix(19:22,nrow=2,ncol=2)
colnames(m3) <- c("A","B")
rownames(m3) <- c("A","B")
Use one of the matrices as a foundation to build from:
A <- m1
Join and find means:
for(i in 2:3){
mat <- noquote(paste0("m", i))
B <- rbind(as.data.frame(as.table(A)), as.data.frame(as.table(mat)))
A <- xtabs(Freq ~ Var1 + Var2, aggregate(Freq ~ Var1 + Var2, B, mean))
}
The problem is with as.table(mat), resulting in an error:
Error in as.table.default(mat) : cannot coerce to a table
This is just a working example, the real application repeats this over thousands of matrices with different naming conventions. Inserting noquote(paste0("m", i)) directly into as.table() also fails.
Simply replacing mat with the matrix object directly works fine (i.e. as.table(m2)). Thanks!
Here, we need get to get the value from the string identifier object
for(i in 2:3){
mat <- get(paste0("m", i))
B <- rbind(as.data.frame(as.table(A)), as.data.frame(as.table(mat)))
A <- xtabs(Freq ~ Var1 + Var2, aggregate(Freq ~ Var1 + Var2, B, mean))
}
A
# Var2
#Var1 A B C
# A 12.25 14.75 11.50
# B 13.25 15.75 12.50
# C 7.50 10.50 13.50

Replacing all negative values from a dataset

I have a dataframe with mixed data ranging from variables(or columns) with numerical values to variables(or columns) with factors.
I would like to use the following piece of code in R to replace all negative values with NA and subsequently remove the entire variable if more than 99% of observations for that variable are NA.
The first part should make sure there is no problem when encountering strings.
Would it be possible to simply start with:
mydata$v1[mydata$v1<0] <- NA
But then not specific for v1 and only if the observation is not a string ?
Follow up:
This is how far I got with the explanation provided by #stas g. It does however not seem like any variable was dropped from the df.
#mixed data
df <- data.frame(WVS_Longitudinal_1981_2014_R_v2015_04_18)
dat <- df[,sapply(df, function(x) {class(x)== "numeric" | class(x) ==
"integer"})]
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
#process numeric part of the data separately
ii <- sapply(df, class) == "numeric" | sapply(df, class) == "integer"
dat.num <- foo(as.matrix(df[, ii]), 0.99)
#then stick the two parts back together again
WVS <- data.frame(df[, !ii], dat.num)
impossible to know exactly how to help you without a minimal reproducible example, but assuming you have a sample data below:
#matrix of random normal observations, 20 samples, 5 variables
dat <- matrix(rnorm(100), nrow = 20)
#if entry is negative, replace with 'NA'
dat[dat < 0] <- NA
#threshold for dropping a variable
p <- 0.99
#check how many NAs in each column (proportionally)
ind <- colSums(is.na(dat))/nrow(dat)
#only keep columns where threshold is not exceded
dat <- dat[, ind < p]
if you have non-numeric variables and you are dealing with a data.frame you could do something like this (assuming you don't care about order of columns):
#generate mixed data
dat <- matrix(rnorm(100), nrow = 20) #20 * 50 numeric numbers
df <- data.frame(letters[1 : 20], dat) #combined with one character column
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
#process numeric part of the data separately
ii <- sapply(df, class) == "numeric" #ind of numeric columns
dat.num <- foo(as.matrix(df[, ii]), 0.99) #feed numeric part of data to foo
#then stick the two partw back together again
data.frame(df[, !ii], dat.num)
This approach: Solution by YOLO suggested by #YOLO finally solved the issue:
cleanFun <- function(df){
# set negative values as NA
df[df < 0] <- NA
# faster, vectorized solution
# select numeric columns
num_cols <- names(df)[sapply(df, is.numeric)]
# get name of columns with 99% or more NA values
col_to_remove <- names(df)[colMeans(is.na(df[num_cols]))>=0.99]
# drop those columns
return (df[setdiff(colnames(df),col_to_remove)])
}
your_df <- cleanFun(your_df)

R, apply function on every second column of a data frame?

How to apply a function on every second column of a data frame? That is to say, how to modify df2 <- sapply(df1, fun) such that df2 equals df1 but with fun applied to every second column? Here is what I tried:
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- sapply(df1[c(TRUE, FALSE)], function(x) x^2)
isTRUE(dim(df1)==dim(df2)) # FALSE
The problem with this code is, that it deletes all columns to which fun was not applied to (dim(df2) # 5 1).
Assigning variables to slices
You can assign new values for subsets of an object. Say for:
x <- c(1,2,3)
x[2] <- 4
Now x will be c(1,4,2). Similarly you can do this for row/columns of a matrix or dataframe. Here we use the apply function with the second argument 2 for cols (1 for cols). I recommend the seq function to generate a sequence of indices from=1, by=2 gives odd and from=2, by=2 gives even indices. Specifying this it way generalises to other subsets and straightforward to check you got it right.
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- apply(df2[,seq(1, ncol(df2), 2)], 2, function(x) x^2)
Loops
Note that you can also do this with a loop:
df2 <- df1
for(col in seq(1, ncol(df2), 2)) df2[,col] <- sapply(df2[,col], function(x) x^2)
Vectorised functions
Since the squared operation is "vectorised" in R, in this case you could also do:
for(col in seq(1, ncol(df2), 2)) df2[,col] <- df2[,col]x^2
Or use vectorisation completely:
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- df2[,seq(1, ncol(df2), 2)]^2

Making a column to help aggregation in r dataframe

I need to construct a new column for R dataframe that would help in aggregation.
First, I have some vectors:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
and dataframe DF which has column VAR with the items included in the vectors. Now I want to make new column AGGVAR:
DF$AGGVAR[DF$VAR %in% vector1] <- "vector1"
This is manageable with small amount of vectors but I want to make it neater for more vectors. I made list
vectorList <- ls(pattern = "^vector")
and my obviously naive attempt was
for(i in regList){DF$AGGVAR[DF$VAR %in i] <- i}
What is still needed to make this work?
EDIT: My problem was actually bit more hairy than I first presented. The vectors don't actually have neat numerical suffixes, e.g.:
vectorGHI <- c("ITEM11","ITEM12","ITEM13")
vectorJKL <- c("ITEM21","ITEM22","ITEM32")
Something like this should do the trick:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
d <- data.frame(var=c(vector1, vector2))
L <- mget(ls(patt='^vector'))
d$aggvar <- paste0('vector', sapply(d$var, grep, L))
d
# var aggvar
# 1 ITEM11 vector1
# 2 ITEM12 vector1
# 3 ITEM13 vector1
# 4 ITEM21 vector2
# 5 ITEM22 vector2
# 6 ITEM32 vector2
An alternative, which might have better performance:
lookup <- cbind(unlist(L),
c(mapply(rep, names(L), sapply(L, length))))
d$aggvar <- lookup[match(d$var, lookup[, 1]), 2]
Slightly modified answer based on jbaums' suggestion to make this complete:
namesVectors <- ls(pattern = "^vector")
vectorList <- mget(namesVectors)
# Getting rid of auxiliary prefix
namesVectors <- substring(namesVectors, 7)
DF$AGGVAR <- sapply(DF$VAR, grep, vectorList)
for(i in length(namesVectors)) {DF$AGGVAR[DF$AGGVAR == i] <- namesVectors[i]}

return identical DF or vector instead of NULL

users,
I have data.frames which are NULL in my results, but I don't want them to be NULL. I want them to be the same as the beginning (unchanged). I'm working on a list of files and the aim of my code is to fill all the NA with data from my other data.frames (according to the best correlation coefficient). Here's a small example:
Imagine these are my 3 input data frames (10 rows each):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,NA,NA,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
The aim here is for example, if there're NAs in ST1, ST1 must be filled with data from the best correlated file with ST1 (between ST2 and ST3 in this example)).
As ST3 has no data here, I cannot have any correlation coefficient. So NAs from ST3 cannot be filled, and ST3 cannot also be used to fill another file. So ST3 has no use if you want. Nevertheless I want to keep ST3 unchanged during all my code.
So the problem in my code comes from data.frames with no data and so with only NAs.
For the moment my code would give this for "refill" (end of my code) (filled NA in my data.frames):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- NULL
But actually, I want for results in "refill" this:
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
So for data.frames with only NAs, I don't want them to be NULL in "refill", but I want them to be identical as in input. I need this to have the same dimensions of data.frames between inputs and outputs.
If they are as NULL (like it is for the moment but I don't understand why and I want to change this), there will be 0 rows in this data.frame instead of 10 rows like the other data.frames.
So I think there's something wrong in my code in function "process.all" or "na.fill" or maybe "lst".
Here's my code and it is a reproductible example for you to understand my error (you'll see in head(refill) ST2 is set as NULL).
Sorry if it is a bit long but my error depends on other functions previously used. Hope you've understand my problem and what I'm trying to do. Thanks for your help!
(For information, in function "process.all" and "na.fill": x is the data.frame I want to fill, and y is the file which will be used to fill x (so the best correlated file with x)).
Geoffrey
# my data for example
DF1 <- data.frame(x1=c(NA,NA,rnorm(3:20)),x2=c(31:50))
write.table(DF1,"ST001_2008.csv",sep=";")
DF2 <- data.frame(x1=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,rnorm(1:10)),x2=c(1:20))
write.table(DF2,"ST002_2008.csv",sep=";")
DF3 <- data.frame(x1=rnorm(81:100),x2=NA)
write.table(DF3,"ST003_2008.csv",sep=";")
DF4 <- data.frame(x1=c(21:40),x2=rnorm(1:20))
write.table(DF4,"ST004_2008.csv",sep=";")
# Correlation table
corhiver2008capt1 <- read.table(text=" ST001 ST002 ST003 ST004
ST001 1.0000000 NA -0.4350665 0.3393549
ST002 NA NA NA NA
ST003 -0.4350665 NA 1.0000000 -0.4992513
ST004 0.3393549 NA -0.4992513 1.0000000",header=T)
lst <- lapply(list.files(pattern="\\_2008.csv$"), read.table,sep=";", header=TRUE, stringsAsFactors=FALSE)
Stations <-c("ST001","ST002","ST003","ST004")
names(lst) <- Stations
# searching the highest correlation for each data.Frame
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
m <- max(mat[station, ],na.rm=TRUE)
if (is.finite(m)) {return(which( mat[station, ] == m ))}
else {return(NA)}
}
# fill the data.frame with the data.frame which has the highest correlation coefficient
na.fill <- function(x, y){
if(all(!is.finite(y[1:10,1]))) return(y)
i <- is.na(x[1:10,1])
xx <- y[1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][i] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude),new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(!is.finite(x[1:10,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:10,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:10,1][nas]))){
xx <- df.list[[y]][1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][nas] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill) ########## HERE IS THE PROBLEM ######
refill
How about
if(sum(!is.na(ST3)) == 0) {
skip whatever you normally would do and go to the next vector
}
This assumes, of course, that you don't have any problems with, say, a vector of 1999 NAs and one numerical value.

Resources