Add multiple rows per observation - r

I'm trying to construct a panel dataset that can work as a "vessel" where I can put my real data.
I have information about 346 municipalities, and I want to add daily information, for a total of 166 days. So, for each municipality, I want it to have 166 rows (per day). I've only managed to get a dataset with 57.436 rows (which would be 346*166), but I can't find a way to include both the name of the municipality, and the date. It's one thing or the other. Any ideas on how I can do this? The code that I'm using so far, which produces 346 observations per day, is the following:
comunas_panel <- data.frame()
for(i in 1:nrow(codigos_territoriales)) {
dates <- data.frame(date = seq(from = as.Date("2019-10-18"),
to = as.Date("2020-03-31"), by = 1))
comunas_panel = rbind(comunas_panel, dates)
}

Try the expand.grid() function (documentation here).
It takes any number of arguments that can be named whatever you choose. Each argument is a vector. The result is a data frame with columns named after the arguments, with all possible combinations of the elements from each of the vectors you input. So in this example I use your vector of 166 dates and cross it with a toy example of 3 municipality names to get a data frame with 166*3 = 498 rows and 2 columns (date and municipality).
date <- seq(from = as.Date("2019-10-18"), to = as.Date("2020-03-31"), by = 1)
municipalities <- c('name1', 'name2', 'name3') #etc.
comunas_panel <- expand.grid(municipality = municipalities, date = date)
Similar alternatives are expand_grid() in tidyverse, and CJ() in data.table.

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

R: How to Count Rows with Subsetted Date in Date Formatted Column

I have about 30,000 rows of data with a Date column in date format. I would like to be able to count the number of rows by month/year and year, but when I aggregate with the below code, I get a vector within the data table for my results instead of a number.
Using the hyperlinked csv file, I have tried the aggregate function.
https://www.dropbox.com/s/a26t1gvbqaznjy0/myfiles.csv?dl=0
short.date <- strftime(myfiles$Date, "%Y/%m")
aggr.stat <- aggregate(myfiles$Date ~ short.date, FUN = count)
Below is a view of the aggr.stat data frame. There are two columns and the second one beginning with "c(" is the one where I'd like to see a count value.
1 1969/01 c(-365, -358, -351, -347, -346)
2 1969/02 c(-323, -320)
3 1969/03 c(-306, -292, -290)
4 1969/04 c(-275, -272, -271, -269, -261, -255)
5 1969/05 c(-245, -240, -231)
6 1969/06 c(-214, -211, -210, -205, -204, -201, -200, -194, -190, -186)
I'm not much into downloading any unknown file from the internet, so you'll have to adapt my proposed solution to your needs.
You can solve the problem with the help of data.table and lubridate.
Imagine your data has at least one column called dates of actual dates (it is, calling class(df$dates) will return at least Date or something similar (POSIXct, etc).
# load libraries
library(data.table)
library(lubridate)
# convert df to a data.table
setDT(df)
# count rows per month
df[, .N, by = .(monthDate = floor_date(dates, "month")]
.N counts the number of rows, by = groups the data. See ?data.table for further details.
Consider running everything from data frames. Specifically, add needed month/year column to data frame and then run aggregate using data argument (instead of running by separate vectors). Finally, there is no count() function in base R, use length instead:
# NEW COLUMN
myfiles$short.date <- strftime(myfiles$Date, "%Y/%m")
# AGGREGATE WITH SPECIFIED DATA
aggr.stat <- aggregate(Date ~ short.date, data = myfiles, FUN = length)

How does zoo() fill n rows while the original Dataframe has n-1 row?

library(dplyr)
I have a dataframe with 114 rows :
df = data.frame(a= (seq(from = as.Date("2016-11-27"), to = as.Date("2019-01-27"), by = 7)), b=seq(0:5)) #Create a dataframe
colnames(df) <- c("time","value") # change col names
Here however we will remove the 4 first rows of the dataframe
neodf <- (slice(df, 5:nrow(df)))
colnames(neodf) <- c("time","value") # change col names
We create a zoo time series with the same sequence as the original dataframe but with the values of the new dataframe
ts <- zoo(neodf$value, seq(from = as.Date("2016-11-27"), to = as.Date("2019-01-27"), by = 7))
We can see that the zoo object has indeed more rows than neodf so I was wondering if the zoo object automatically shift values and create values at the end or viceversa ?
My original problem is with some sales timeseries. The original dataframe has 4 observations every month, but in one year at December we only have 1 observation. As you can imagine, since I'm using a zoo() object for the transformation (by using the sequence option) I end up with 4 observations in December and they actually contain values !
Thanks !

How to subtract values by comparing columns from two datasets?

I have the following data structure:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Two dataframes.
map: with three variables. Chr, Pos and CM.
Data: with three variables: Chr, Pos.1, Pos.2
Matching Data$Pos.2 and Data$Pos.1 with map$Pos, I need to get the difference of map$CM values between these two matches. This procedure needs to be done by $Chr.
As an example: For the first row of Data (1,30,40) the desirable value would be 0.1010101 (this is obtained by the operation 0.39393939 – 0.29292929). for the first row of Data with Chr = 2 (2,4,9) the desirable value would be 0.06468352 (0.1026582-0.03797468).
Whether I well understood what you desire, I think you have to do something like this:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Using library tidyverse
library(tidyverse)
You have to tranform your data into dataframes:
Data <- as.data.frame(Data)
map <- as.data.frame(map)
Then you have just to retrieve information using left_join
Data_CM <- left_join(Data,map,by=c("Chr","Pos.1"="Pos")) %>%
rename(CM.1=CM)
Data_CM <- left_join(Data_CM,map,by=c("Chr","Pos.2"="Pos")) %>%
rename(CM.2=CM)
The Diff variable will compute the difference between two retrieved values
Data_CM <- Data_CM %>%
mutate(Diff=(CM.2-CM.1))

Resources