R creating multiple 2 by 2 tables from a data frame - r

Next question - I have created the following data frame in R
x <- as.integer(rnorm(n=1000, mean=10, sd=5))
y <- 1:1000
z <- sample (c(0,1),1000, replace=T)
df <- data.frame(x,y,z)
# create variables df using x
for(i in 1:10){
df[paste0("col",i)] <- ifelse(df$x <i, 1, 0)
}
# create 2 by 2 tables of z against col1 to col 10
for(i in 1:10){
table[i] <- table (df[paste0("col",i)], df$z)
}
I already received some excellent help to create variables in R using a for loop within a data frame.
However i am now struggling with using a similar for loop to create a two by two table (last section of the code).
Can anybody tell where i am going wrong?
Thanks again as always!

There are several problems with the code you have written.
First of all, the table data-object does not exist, so you cannot index-assign to it.
Secondly, you need to use "[[" when accessing a named item (otherwise you get a sublist).
Finally, if you make a list, which is really the most sensible type of storage for a series of table-objects, you need to use "[[" rather than "[" to extract an item (rather than a sublist).
I also took the liberty of renaming it to tbl so there would not be cognitive confusion about what was function and what was data.
tbl<- list();
for(i in 1:10){
tbl[[i]] <- table (df[[paste0("col",i)]], df$z)
}
tbl[[1]]
0 1
0 488 473
1 16 23

Related

Calculating log returns over columns of a data frame + store the results in a new data frame

My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))

Adding Zero to a column in first x rows in R

I am creating a classification model for forecasting purposes. I have several ext files which I converted into one large list containing several lists (called comb). I then broke the large list into a separate dataframe with each list as its own column (called BI). Because each list may contain different number of elements, the simpler argument matrix(unlist(l), ncol=ncol) does not work. When reviewing alternatives, I made modification to compile the following:
max_length <- max(sapply(comb,length))
BI<-sapply(comb, function(x){
c(x, rep(0, max_length - length(x)))
})
This creates a dataframe assigning each list a column and assigning each missing element within that column a value of ZERO. Those zeros show at the end of that column but I would like them to be at the beginning of the column. Here is an example of current output:
cola colb colc
2 2 2
1 1 0
4 0 0
I need your help in converting my original code to produce the following format:
acola colb colc
2 0 0
1 2 0
4 1 2
It might be sufficient to interchange the order in the concatenation c:
max_length <- max(sapply(comb, length))
BI <- sapply(comb, function(x){
c(rep(0, max_length - length(x)), x)
})
EDIT: Based on additional information in the comments below, here's an approach that modifies the code in another way. The idea is that as long as your first approach gives
you a proper data frame, we can circumvent the problem by using
the order-function.
max_length <- max(sapply(comb,length))
BI <- sapply(comb, function(x){
.zeros <- rep(0, max_length - length(x))
.rearange <- order(c(1:length(x), .zeros))
c(x, .zeros)[.rearange]
})
I have tested that this code works upon a minor test example I
created, but I'm not certain that this example resembles your
comb...
If this revised approach doesn't work, then it's still possible
to first create the data frame with your original code, and
then reorder one column at the time.

Naming dataframes based on counter iteration in R?

I have a loop that will spit out a bunch of dataframes, and want to name the dataframes based on current iteration of the loop, e.g. df1 for the first iteration, df2 for the second iteration, and so on.
However, i'm running into problems trying to use the loop iteration counter to construct the dataframe name. For example, let's imagine I am in the first iteration of the loop and want to name the dataframe:
counter <- 1
as.name(paste("df",counter,sep="")) <- data.frame(x = (1:10), y = (10:1))
I get an error
Error in as.name(paste("df", counter, sep = "")) <- data.frame(x = (1:10), :
target of assignment expands to non-language object
Does anyone know how I might use the counter information to create dataframe names?
This is meant to complement Richard's, as it felt a little too substantial to simply edit into his.
A typical code pattern for this sort of thing would be:
#Initialize an empty list of the desired length
dfs <- vector("list",3)
#Fill the list with data frames, naming as we go
for (i in seq_along(dfs)){
dfs[[i]] <- data.frame(x = runif(5),y = runif(5))
names(dfs)[[i]] <- paste0("df",i)
}
where the use of assign is typically frowned upon as bad (stylistically). If the naming of the data frames is very regular, you don't even need to do it in the loop:
names(dfs) <- paste0("df",seq_along(dfs))
you can do it in a vectorized fashion as above. And as I mentioned below Richard's answer, even though having them all in a list is never worse, and usually better, than having them as separate objects, you can convert the list to separate objects via:
list2env(dfs,envir = .GlobalEnv)
Instead of cluttering the global environment with data frames, it would be best to collect them in a list, and then you can use paste0 to name them in setNames with e.g.
> dfList <- setNames(list(data.frame(x = 1:10, y = 10:1)), paste0("df", 1))
after that you can refer to the data frame with
> dfList$df1
x y
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
As joran notes, if you insist on populating the global environment with these data frames, you can use
list2Env(dfList, envir = .GlobalEnv)
and the data frames will be assigned as objects in the global environment.
Use assign:
assign(paste0("df", counter), data.frame(x = (1:10), y = (10:1))
I think you are looking for
assign("name", dataframe)

How to enter in R the results of each for loop run in a new column in a new matrix

I'm new to R and I'm trying to solve a problem by piecing scripts together and have failed to come up with a solution for the following problem.
What I want to do is to run the for-loop in the following script 100 times and have each results side by side in new columns in a data frame/table/matrix, ideally keeping the 1 column constant and merge all new runs to it.
Here is a script that I'm running:
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
site_sample=data[row,]
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
write.table(site_sample, file="test1.csv", sep = ",")
Thanks for your help.
I'm trying to get something that looks like this:
a 9 15 21 ...(100 columns)
b 15 16 19 ...
c 26 19 10 ...
I'm not sure if this is exactly what you're looking for (I'm pretty sure the values I used for sampling aren't what you want), but the following code segment is able to put the data in the format you're looking for.
data <- data.frame(a=2, b=3, c=4) # Test data
row <- 1
N_samples <- 50
newDF <- data.frame(names(data), row.names = 1) # Creates new, empty dataframe
for (i in 1:ncol(data)) {
# Not sure if the next line is what you want, but it works as a good demo
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
# Adds the count of the values generated above as the next col in the dataframe
newDF <- cbind(newDF, as.vector(table(samples)))
}
colnames(newDF) <- rep(" ", ncol(newDF)) # Removes dataframe's column names

Populating a data frame in R in a loop

I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])

Resources