Splitting data frame in r into two - r

I have a data frame that has one column 'Date2' as modulus of 5 of another data column 'Date'. I want to split data into two data frames, one containing all values where modulus is 0 and 2nd all others.
Here is my code that is working on this reproducible code. Though as I have to apply it on a big data, I want to know it is appropriate way for this purpose.
Here is my code:
DD<-seq(as.Date("2019/01/01"), by = "day", length.out =31) #creating data for df
DD<-DD
DD2 <- data.frame("Date"=DD, var = c(1:31)) # reproducible df for testing
DD2<-DD2
DD2<-DD2%>%
mutate(Date2=mday(Date)%%5)# getting modulus of Date col in Date2 col
DD2
D3<-split(DD2, DD2$Date2==0) #all records with 0 remainder of 5
D4<-split(DD2, DD2$Date2!=0) # all other records
D3
D4

If we need to split, we can also use group_split
DD2 %>%
group_split(new = as.integer(!mday(Date) %% 5))
Or with split
DD3 <- split(DD2, DD2$Date2==0)
and then extract the elements with [[
DD3[["TRUE"]]
DD3[["FALSE"]]
But, it could be also created as a grouping variable instead of splitting into multiple datasets
DD2 %>%
mutate(grp = as.integer(!mday(Date) %% 5))

Related

How to split a dataframe into a list keeping the previous dataframe in R?

I have this data frame
This is a minimal reproducible example of my data frame
value <- c(rnorm(39, 5, 2))
Date <- seq(as.POSIXct('2021-01-18'), as.POSIXct('2021-10-15'), by = "7 days")
df <- data.frame(Date, value)
# This is the vector I have to compare with the Date of the dataframe
dates_tour <- as.POSIXct(c('2021-01-18', '2021-05-18', '2021-08-18', '2021-10-15'))
df <- df %>%
mutate(
tour = cut(Date, breaks = dates_tour, labels = seq_along(dates_tour[-1]))
)
Now that I have the data frame label on each group based on the dates_tour I want to split the data frame based on the tour factor but I need that each list contains the data frame of the previous data frame.
For instance df_list[[1]] contains the rows with tour == 1The second list needs to contain the first and the second data frame tour == 1 | tour == 2. The third list needs to contain the first, second, and third data frames and so on. I need to work writing a general code that works with different lengths of dates_tour as sometimes it can contain different lengths of values.
This code creates a list based on the tour value
df_list = split(df, df$tour)
But is not useful to create what I need
You could also do:
Reduce(rbind, split(df, ~tour), accumulate = TRUE)
if you have an older version of R:
Reduce(rbind, split(df, df$tour), accumulate = TRUE)
You could also use accumulate from purrr:
library(purrr)
accumulate(split(df, ~tour), rbind)
We may use a loop for that
df_list <- lapply(unique(df$tour), function(x) subset(df, tour %in% seq_len(x)))

R: extract the last two numbers in a variable

I have two datasets (data1 and data2).
Data1 has (one of many) a column named: B23333391
Data2 has a column called id_number, where id numbers are listed (e.g. 344444491)
I need to extract the last two digits (91) from the variable in data1 and merge it with the last two digits of the id number in data2 in column id_number
Since the last two digits represents an individual.
E.g.:
Data1:
columns: -> B23333391..... and so on
Data2:
columns: -> id_number
344444491
and so on....
How can this be done?
Thanks in advance!
Try this approach. You can use a dplyr pipeline to format an id variable in both dataframes using substr(). The last two digits can be extracted with nchar(). After that you can merge using left_join(). Here the code with simulated data similar to those shared by you:
library(dplyr)
#Data
df1 <- data.frame(Var1=c('B23333391'),Val1=1,stringsAsFactors = F)
df2 <- data.frame(Varid=c('344444491'),Val2=1,stringsAsFactors = F)
#Merge
dfnew <- df1 %>%
mutate(id=substr(Var1,nchar(Var1)-1,nchar(Var1))) %>%
left_join(df2 %>% mutate(id=substr(Varid,nchar(Varid)-1,nchar(Varid))))
Output:
Var1 Val1 id Varid Val2
1 B23333391 1 91 344444491 1

How can I create subsets from these data frame?

I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?

R: Fill up data frame with data and match values to the right date

In reality I have a really messy situation. I have 60 single tables which contain data between the dates 2009-01-01 to 2017-09-30. But the values are not continuous for each day. There is data for one day per month in a few tables or in three days intervals. Sometimes there is a value for every day per month.
I want to find out for which date per year the frequency of the data is highest. I need this for interpolations afterwards.
My idea: I build a data frame and in the first column are the dates from 2009-01-01 to 2017-09-30 continuously. Now I want to fill up this data frame with the 60 tables where the data is not continous.
I need a code for matching data to the right date inside the data frame WholeData(see example). And I don't need the date of the single tables anymore, because it's already in the first column.
Example code simplified:
df1 <- sample(seq(as.Date('2009-01-01'), as.Date('2009-09-30'), by = "day"),
12)
df1 <- sort(df1)
expenses1 <- sample(180, 12)
df1 <- data.frame(df1, expenses1)
df2 <- sample(seq(as.Date('2009-01-01'), as.Date('2009-09-30'), by = "day"),
12)
df2 <- sort(df2)
expenses2 <- sample(180, 12)
df2 <- data.frame(df2, expenses2)
WholeData <- seq(as.Date("2009-01-01"), by = 1, as.Date("2009-09-30"))
df <- data.frame(WholeData)
df1 and df2 standing for my 60 messy tables. Time interval is reduced, too.
First of all, I would recommend organize all your data frames into a list:
data_list <- list(df, df1, df2)
Here is a perfect explanation why it is important and which more advanced (and scalable!) approaches may be used.
Besides, it makes sense to set the same name for all the columns which contains the date values:
for (i in seq(along.with = data_list)) {
colnames(data_list[[i]])[1] <- "date"
}
The "date" column will be a key column for further joint of the data frames.
Now, when preprocessing is done, you may build the final data frame choosing one of the available methods.
# with base R
res_1 <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "date", all.x = TRUE),
data_list)
#using tidyverse tools
library(tidyverse)
# with purr package
res_2 <- data_list %>% purrr::reduce(full_join, by = "date")
# with dplyr package
res_3 <- data_list %>%
Reduce(function(dtf1, dtf2) dplyr::full_join(dtf1, dtf2, by = "date"), .)

Split a large file into equal rows in R

Here is my sample data
library(dplyr)
Singer <- c("A","B","C","A","B","D")
Rank <- c(1,2,3,3,2,1)
data <- data_frame(Singer,Rank)
I would like to split the data into three separate csv files, and each of them should have two rows. I tried to use the split function, but it did not word out as I expected.
d <- split(data,rep(1:2,each=2))
Group first, then use do to apply the writing function to each pair of rows.
library(dplyr)
library(readr)
data %>%
group_by(g = ceiling(row_number() / 2)) %>%
do(write_csv(., paste0(.$g[1], '.csv')))

Resources