Trying to add multiple sequentially number columns to data frame in r - r

I need to add 7 empty columns (to represent days of the week) to an existing data frame and especially helpfull if they can be preceeded by the word "Day"
I have previously used 7 lines like this
DF$'Day 1' <- ''
DF$'Day 2' <- ''
Is it possible to shorten this, possibly using a loop?
eg. for(i in 1:7) {DF#'Day [i]' <- ''}
Which obviously doesn't work otherwise I wouldn't need to be asking.

If you attempt to assign to non-existent columns then they just get created for you automagically.
DF <- data.frame(x = 1:4, y = 'hi')
days <- paste0('Day',1:7)
DF[,days] <- NA

If you need to create an empty data frame then you can do the following:
DF <- as.data.frame(matrix(nrow=0, ncol=20))
names(DF) <- paste("Val", 1:20, sep="")

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Creating multiple dataframes in a loop in R

I am new to R and I don't know how to create multiple data frames in a loop. For example:
I have a data frame "Data" with 20 rows and 4 columns:
Data <- data.frame(matrix(NA, nrow = 20, ncol = 4))
names(Data) <- c("A","B","C","D")
I want to choose the rows of Data which its values in column T are the closest values to the vector elements of X.
X = c(X1,X2,X3,X4,X5)
Finally, I want to assign them to a separate data frames with their associated X name:
for(i in 1:length(X)){
data_X[i] <- data.frame(matrix(NA))
data_X[i] <- subset(data2, 0 <= A-X[i] | A-X[i]< 0.000001 )
}
Thank you!
Since you didn't give us any numbers, it is difficult to say exactly what you need the for loop to look for. As such, you will need to sort that out yourself, but here is a basic example of what you could do. The important part that I think you are missing is that you need to use assign to send the created dataframes to your global environment or wherever you want them to go for that matter. Paste0 is a handy way to give them each their own name. Take note that some of the data frames will be empty. It may be worthwhile to use an if statement that skips assigning the dataframe if (nrow(data3)==0).
`Data <- data.frame(matrix(sample(1:10,80,replace = T), nrow = 20, ncol = 4))`
`names(Data) <- c("A","B","C","D")`
`X = c(1:10)`
`for(i in 1:length(X)){
data2 <- Data
data3 <- subset(data2, A == X[i])
assign(paste0("SubsetData",i), data3, envir = .GlobalEnv)
}`

Can't reorder data frame columns by matching column names given in another column

I'm trying to re-order the variables of my data frame using the contents of a variable in another data frame but it's not working and I don't know why.
Any help would be appreciated!
# Starting point
df_main <- data.frame(coat=c(1:5),hanger=c(1:5),book=c(1:5),
bottle=c(1:5),wall=c(1:5))
df_order <- data.frame(order_var=c("wall","book","hanger","coat","bottle"),
number_var=c(1:5))
# Goal
df_goal <- data.frame(wall=c(1:5),book=c(1:5),hanger=c(1:5),
coat=c(1:5),bottle=c(1:5))
# Attempt
df_attempt <- df_main[df_order$order_var]
In you df_order, put stringsAsFactors = FALSE in the data.frame call.
The issue is that you have the order as a factor, if you change it to a character it will work:
df_goal <- df_main[as.character(df_order$order_var)]

R - Subset a Dataframe with a Programmatically built Formula

I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}

How to know if the value if a column is part of another column's value in R data.table

I have a data.table where I have few customers,some day value and pay_day value .
pay_day is a vector of length 5 for each customer and it consists of day values
I want to check each day value with the pay_day vector whether the day is part of the pay_day
Here is a dummy data for this (pardon for the messy way to create the data ) could not think of a better way atm
customers <- c("179288" ,"146506" ,"202287","16207","152979","14421","41395","199103","183467","151902")
mdays <- 1:31
set.seed(1)
data <- sort(rep(customers,100))
days <- sample(mdays,1000,replace=T)
xyz <- cbind(data,days)
x <- vector(length=1000L)
j <- 1
for( i in 1:10){
set.seed(i) ## I wanted diff dates to be picked
m <- sample(mdays,5)
while(j <=100*i){
x[j] <- paste(m,collapse = ",")
j <- j+1
}
}
xyz <- cbind(xyz,x)
require(data.table)
my_data <- setDT(as.data.frame(xyz))
setnames(my_data, c("cust","days","pay_days"))
my_data[,pay:=runif(1000,min = 0,max=10000)]
Now I want for each cust the vector of pays which happens in pay_days.
i have tried various ways but cant seem to figure it out , my initial thought is to create a flag based if days is a subset of pay_days and then take the pays according to the flag
my_data[,ifelse(grepl(days,pay_days),1,0),cust]
this does not work as I expect it to . I dont want to use a native loop as the
actual data is huge .
Using tidyr to split the pay_days column into and then checking if days is in pay_days:
library(tidyr)
library(dplyr)
# creating long-form data
tidier <- my_data %>%
mutate(pay_days = strsplit(as.character(pay_days), ",")) %>%
unnest(pay_days)
# casting as numeric to make factor & character columns comparable
tidier[, days := as.numeric(days)]
tidier[, pay_days := as.numeric(pay_days)]
tidier[days == pay_days, pay, by=cust]
Not sure how this performs for large data, as you multiply your table length by the number of days in pay_days...
Side note: I can't comment yet, but to replicate your data one needs to add library(data.table) and initialize x x<-vector() which is otherwise not found, as Dee also points out.
Another one-liner approach using the data table:
my_data[,result:=sum(unlist(lapply(strsplit(as.character(pay_days),","),match,days)),na.rm=T)>0,by=1:nrow(my_data)]

Resources