I would like to create a simple code to join columns then count how many category variable has in my data frame. My problems is when I use apply, I get a right result but five times.
EXAMPLE:
a <- c('car','bike',NA,'moto','skate')
b <- c(NA,'car',NA,NA,'bike')
c <- c('car',NA,NA,'skate',NA)
d <- c('moto','skate',NA,'car',NA)
data <- data.frame(a,b,c,d)
then, using apply:
x <- vector('list',length = NCOL(data)*NROW(data))
one_column <- apply(data,1,function(y){
x <- rbind(y,x)
return(x)
})
Then unlist and use table for count how many categorical variables I have in my data:
one_column <- unlist(one_column)
table(one_column)
But for I get the right result I need divide by 5:
table(one_column)/5
The x vector you created is 5 times larger than each column, so it's recycling the data. What you want is this instead.
x <- vector('list',length = NCOL(data))
Or, like emilliman5 says, just use table(unlist(data)).
Related
I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.
My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))
I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!
I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation
I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}