I have a 1 column table with postcodes in it: I would like to loop through each postcode using the postcode_lookup() function in the postcodeioR library.
My current attempts are the following:
x <- data.frame()
for(i in 1:3){
x[i, ] <- postcode_lookup(table$Var1[i])
}
So i instantiated a new table and tried to add the result of postcode_lookup to a new row every time. But I get nothing. What i get is data frame with 3 obs. and 0 variables. the data should look like this: imagine 31 columns and multiple rows:
table
You need to explicitly specify the number of columns when creating a data frame:
df <- as.data.frame(matrix(NA, 0, 1))
set.seed(123)
val <- runif(20)
for (i in 1:3){
df[i, ] <- val[[i]]
}
In this case, a matrix with 0 rows and 1 column is converted to a data frame. This is a convenient way to create an empty data frame with the required number of columns.
In your case, you have a data frame with 0 columns. Hence, nothing gets populated.
Related
Is there a way for me to iteratively build a dataframe in R? I would be interested in knowing how I would do so either by adding column-by-column or row-by-row. I have been trying for some time now and find myself stuck.
Here is some code that I have tried:
line <- as.list(strsplit(line, ", "))[[1]] # make into list
col_names = names(idx_for_cell_counts_by_gene_id)
df <- data.frame() # here is where I get stuck - want an empty dataframe
for (x in 1:length(col_names)) {
column_name <- col_names[[x]]
information <- line[[x]]
df$column_name <- information
}
I have tried looking at some SO examples (#1, #2) but to no avail. Is there something I should do to instantiate an empty dataframe (or, better yet, a dataframe with only 'column headers' and now rows) in R?
One issue is that df$column_name creates a column named column_name. It doesn't use the value in the object named column_name. Making a representative example and walking through it will show you:
df <- data.frame(placeholder = 0)
column_name <- "my_col"
# The following will create a column named "column_name"
df$column_name <- 0
# df
# placeholder column_name
# 1 0 0
# The following will create a column with the value inside of the object `column_name`
df[,column_name] <- 0
# df
# placeholder column_name my_col
# 1 0 0 0
Another issue you have is that you're making a data.frame of length 0. That means that any column you add needs to be a matching length. All columns in a dataframe must be the same length.
One way to deal with this is to create a placeholder column when you create the dataframe and then remove it later. df <- data.frame(placeholder = boolean(length(line[[1]]))). There may be other more elegant ways to handle this.
I have imported some Twitter data which gives me a list with tibbles for every user. Each tibble has 11 columns and various number of rows depending on how many lists a Twitter user has.
If a Twitter user has no lists, it is listed as a data frame with 0 rows and 0 columns (see [3] in the picture). I don't want to delete such entries but keep them as a user with no lists.
Hence, I'm thinking whether I can create a tibble with 11 columns and 1 row where each cell contains a "99".
How do I change a data frame within a list to a tibble?
Thanks a lot for your help!
You can try :
#get index of dataframes that has 0 columns
inds <- lengths(list_data_outlier) == 0
#get column names from other dataframe which is not empty
cols <- names(list_data_outlier[[which.max(!inds)]])
#create an empty dataframe with data as 99 and 1 row
empty_df <- data.frame(matrix(99, nrow = 1, ncol = length(cols),
dimnames = list(NULL, cols)))
#replace the dataframes with 0 columns with empty_df
list_data_outlier[inds] <- replicate(sum(inds), empty_df, simplify = FALSE)
Thanks again, #Ronak Shah!
This worked well:
inds <- lengths(list_data_outlier) == 0
empty_df <- list_data_outlier[[which.max(!inds)]][1, ]
list_data_outlier[inds] <- replicate(sum(inds), empty_df, simplify = FALSE)
But instead of choosing the first row and hence, having wrong data in the DF, I used the 50th row:
empty_df <- list_data_outlier[[which.max(!inds)]][50, ]
The number is depending on the number of entries nrows + 1.
That way you'll get a tibble with 1 row and the same number and types of columns as in the rest of your list but instead of filling it with "wrong" data it's filled with NAs which is what I needed to continue with my analysis.
I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.
search <- function(x,max_hp){
count <- 1
result <- matrix(NA, nrow =nrow(x), ncol = ncol(x))
for(i in 1:nrow(x)){
temp_row <- x[i,]
if(temp_row[4] < max_hp){
result[count,] <- temp_row
count <- count + 1
}
}
return(result)
}
I want to search the rows of mtcars data frame in R that have hp > 240
using a for loop (iterating over each row of the data frame) and then, return only the ones that match. But, my code doesn't work. I want to store each matched row in an empty matrix.
I have too few points to comment but I have a couple points to share. First, I agree with #Otto Kässi or #seeellayewhy. I would just add that if you don't whant any NAs in mtcars$hp to remain in your result, you need to use
result <- mtcars[which(mtcars$hp > 240),]
Regarding substituting rows, I would just follow the above command with
result <- rbind(result,newrows)
R will complain if any attributes of the columns in newrows are different than in result, especially if any of your columns are factor data types with any difference in the levels defined.
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])