R - Using str_split and unlist to create two columns - r

I have a dataset that has dates and interest rates in the same column. I need to split these two numbers into two separate columns, however when I use the following code:
Split <- str_split(df$Dates, "[ ]", n = 2)
Dates <- unlist(Split)[1]
Rates <- unlist(Split)[2]
It returns only the first "value" of each element, i.e., "1971-04-01" for Dates and "7.43" for Rates. I need it to return all values for the portion of the string split and the same for the second portion of the string split
Below is a portion of the dataset, total rows = 518.
1971-04-01 7.31
1971-05-01 7.43
1971-06-01 7.53
1971-07-01 7.60
1971-08-01 7.70
1971-09-01 7.69
1971-10-01 7.63
1971-11-01 7.55
1971-12-01 7.48
1972-01-01 7.44
Thanks

Could do
Split <- strsplit(as.character(df$Dates), " ", fixed = TRUE)
Dates <- sapply(Split, "[", 1)
Rates <- sapply(Split, "[", 2)

You can use reshape2::colsplit
library(reshape2)
colsplit(df$Dates, ' ', names = c('Dates','Rates'))
# Dates Rates
# 1 1971-04-01 7.31
# 2 1971-05-01 7.43
# 3 1971-06-01 7.53
# 4 1971-07-01 7.60
# 5 1971-08-01 7.70
# 6 1971-09-01 7.69
# 7 1971-10-01 7.63
# 8 1971-11-01 7.55
# 9 1971-12-01 7.48
# 10 1972-01-01 7.44

Perhaps I'm biased, but I would suggest my cSplit function for this problem.
First, I'm assuming we are starting with the following (single column) data.frame (where there are multiple spaces between the "date" value and the "rate" value).
df <- data.frame(
Date = c("1971-04-01 7.31", "1971-05-01 7.43", "1971-06-01 7.53",
"1971-07-01 7.60", "1971-08-01 7.70", "1971-09-01 7.69",
"1971-10-01 7.63", "1971-11-01 7.55", "1971-12-01 7.48",
"1972-01-01 7.44"))
Next, get the cSplit function from my GitHub Gist, and use it. You can split on a regular expression (here, multiple spaces).
cSplit(df, "Date", "\\s+", fixed = FALSE)
# Date_1 Date_2
# 1: 1971-04-01 7.31
# 2: 1971-05-01 7.43
# 3: 1971-06-01 7.53
# 4: 1971-07-01 7.60
# 5: 1971-08-01 7.70
# 6: 1971-09-01 7.69
# 7: 1971-10-01 7.63
# 8: 1971-11-01 7.55
# 9: 1971-12-01 7.48
# 10: 1972-01-01 7.44
Since the function converts a data.frame to a data.table, you have access to setnames which would let you rename your columns in place.
setnames(cSplit(df, "Date", "\\s+", fixed = FALSE), c("Dates", "Rates"))[]
# Dates Rates
# 1: 1971-04-01 7.31
# 2: 1971-05-01 7.43
# 3: 1971-06-01 7.53
# 4: 1971-07-01 7.60
# 5: 1971-08-01 7.70
# 6: 1971-09-01 7.69
# 7: 1971-10-01 7.63
# 8: 1971-11-01 7.55
# 9: 1971-12-01 7.48
# 10: 1972-01-01 7.44

Using #user2583119's data (please post minimal reproducible code including a data set):
library(qdap)
colsplit2df(data.frame(Split), sep = " ")
## X1 X2
## 1 1971-06-01 7.53
## 2 1971-05-01 7.43
## 3 1971-06-01 7.53

Also:
Split <- c("1971-06-01 7.53", "1971-05-01 7.43", "1971-06-01 7.53")
Your code selects only the first observation.
Str <- unlist(str_split(Split, "[ ]", n=2))
Str[1]
#[1] "1971-06-01"
If you look at the output of unlist(..), dates are followed by values. So, you can use a logical index.
Str[c(T,F)]
#[1] "1971-06-01" "1971-05-01" "1971-06-01"
as.numeric(Str[c(F,T)])
#[1] 7.53 7.43 7.53
You can convert to two columns of a dataframe from Split by using read.table
read.table(text=Split, header=F, sep="",stringsAsFactors=F)
# V1 V2
# 1 1971-06-01 7.53
# 2 1971-05-01 7.43
# 3 1971-06-01 7.53

df <- data.frame(
Date = c("1971-04-01 7.31", "1971-05-01 7.43", "1971-06-01 7.53",
"1971-07-01 7.60", "1971-08-01 7.70", "1971-09-01 7.69",
"1971-10-01 7.63", "1971-11-01 7.55", "1971-12-01 7.48",
"1972-01-01 7.44"))
do.call(rbind, strsplit(as.character(df$Date), split = '\\s+', fixed = FALSE))

Try this:
Split <- c("1971-06-01 7.53", "1971-05-01 7.43", "1971-06-01 7.53")
df <- unlist(str_split(string = Split, pattern = "\\s"))
df

Related

do.call() isnt appending list of dataframes correctly. any idea why?

I'm downloading a abunch of PiT datasets, and trying to automate their combination into a single time series dataframe (master_df)
temp <- tempfile()
testing <- download.file("https://data.sa.gov.au/data/dataset/3ba1c4dd-e52f-4c28-858a-21284c3ee458/resource/c78fc6da-baa4-47cc-b4df-a97f452bbf9a/download/ken01_p.zip",temp)
filenames<-unzip(temp,list=TRUE)[,1]
#only want csvs
filenames<-filenames[str_detect(filenames,".csv")]
dfnames = list()
for (i in 1:length(filenames)){
conn<-unz(temp, filenames[i])
#name files in loop
filename <- sprintf("df_%s",filenames[i] %>%
str_replace("KEN01_p/KEN01p_1hr","") %>%
str_replace(".csv",""))
# list of filenames
dfnames[[i]] <- filename
assign(filename, read.csv(conn))
}
master_df <- do.call(rbind, dfnames)
unlink(temp)
class(master_df)
class(df_201912)
class(master_df)
[1] "matrix"
class(df_201912)
[1] "data.frame"
the loop is sucessfully reading all the datasets, and renaming them as df_yyyymm, but do.call rbind is just producing a list of data names.
What am I doing wrong?
Thanks!!
No need to use assign since it writes all the dataframes to global environment which is not required. You can combine all the dataframes in one using lapply, also some of the dataframes have different column names so it may be better to use map_df that would combine them into one dataframes anyway by appending NA values.
purrr::map_df(filenames, function(x) {
read.csv(unz(temp, x))
}) -> master_df
master_df
The issue in the code is assignment of the list element with filename instead of the value
for (i in 1:length(filenames)){
conn<-unz(temp, filenames[i])
#name files in loop
filename <- sprintf("df_%s",filenames[i] %>%
str_replace("KEN01_p/KEN01p_1hr","") %>%
str_replace(".csv",""))
# list of filenames
dfnames[[i]] <- read.csv(conn) ###
#assign(filename, read.csv(conn))
}
Also, there are some list elements with different names, thus rbind wouldn't work, we can use rbindlist from data.table
library(data.table)
out <- rbindlist(dfnames, fill = TRUE) dim(out)
[1] 44544 6
This is what I would do to download a zip file, unpack it and read all csv files into one large dataset:
temp <- tempfile()
testing <- download.file(
"https://data.sa.gov.au/data/dataset/3ba1c4dd-e52f-4c28-858a-21284c3ee458/resource/c78fc6da-baa4-47cc-b4df-a97f452bbf9a/download/ken01_p.zip",
temp)
filenames <- unzip(temp, list = FALSE)
library(data.table)
library(magrittr) # piping used to improve readability
master_df <- lapply(filenames, fread) %>%
set_names(filenames %>% basename() %>% stringr::str_remove_all("^KEN01p_1hr|\\.csv$")) %>%
rbindlist(fill = TRUE, idcol = TRUE)
master_df
.id Date/Time PM10 TEOM ug/m3 PM2.5 TEOM ug/m3 Temperature Deg C Barometric Pressure atm PM10 BAM ug/m3
1: 201501 1/01/2015 1:00 18.2 NA 16.8 0.986 NA
2: 201501 1/01/2015 2:00 20.3 NA 15.9 0.985 NA
3: 201501 1/01/2015 3:00 27.9 NA 15.1 0.985 NA
4: 201501 1/01/2015 4:00 23.6 NA 16.9 0.984 NA
5: 201501 1/01/2015 5:00 15.8 NA 19.7 0.984 NA
---
44540: 201912 31/12/2019 20:00 NA NA 19.4 NA 14
44541: 201912 31/12/2019 21:00 NA NA 18.0 NA 14
44542: 201912 31/12/2019 22:00 NA NA 16.7 NA 19
44543: 201912 31/12/2019 23:00 NA NA 15.8 NA 11
44544: 201912 1/01/2020 0:00 NA NA 15.3 NA 12
Note that I have changed the list parameter in
filenames <- unzip(temp, list = FALSE)
to FALSE. This unpacks the zip file into a subdirectory named KEN01_p. After unpacking, the subdirectory contains 61 csv files with 1.5 MBytes in total.
Also note that the .id column in master_df indicates the source of each row

Splitting a part of the file name and transform it into data in R

I have some files in the following format data_25_05_2018.csv which have 4 columns with 30 values each.
I would like to add a column with the same date for each one of them, so the idea would be to tell R to take the name of the file, split it and just take the part 25_05_18 and the transform it into a valid date format and create a column.
Is there a form to transform a part of a file name into data in R?
You could use a regular expression and the dmy() function from lubridate to do this:
library(lubridate)
library(tibble)
## make some fake data
DF <- matrix(rnorm(120), ncol=4)
colnames(DF) <- c("V1", "V2", "V3", "V4")
## turn data into a tibble
DF <- as_tibble(DF)
## make file name
x <- "data_25_05_2018.csv"
## extract everything between data_ and .csv
x <- gsub("data_(.*)\\.csv", "\\1", x)
## use dmy to turnit into a date and add to data frame
DF$date <- dmy(x)
> head(DF)
# # A tibble: 6 x 5
# V1 V2 V3 V4 date
# <dbl> <dbl> <dbl> <dbl> <date>
# 1 0.692 -1.51 1.74 -0.585 2018-05-25
# 2 1.08 -0.812 -1.55 -1.98 2018-05-25
# 3 0.000229 2.55 -0.577 -0.619 2018-05-25
# 4 0.940 -0.906 0.990 -1.48 2018-05-25
# 5 -1.78 0.815 0.436 -0.125 2018-05-25
# 6 -0.324 0.735 0.974 0.151 2018-05-25

How to compare with a "reference date", then fill missing data in R?

Two more questions about this topic:
A
B
First, let me show the example data (Data A & B):
(1) Data A:
Date_Collected A_Value
01/04/2016 10:53 0.137
01/20/2016 13:13 0.204
01/25/2016 11:09 0.199
02/01/2016 12:55 0.441
02/01/2016 12:56 0.215
02/01/2016 13:11 0.397
02/03/2016 09:19 0.377
02/10/2016 08:11 1.45
02/15/2016 13:04 2.63
(2) Data B:
Date_Collected B_Value
01/04/2016 10:53 0.108
01/20/2016 13:13 0.404
02/01/2016 13:11 0.594
02/15/2016 13:04 1.99
Second, I will tell what I want to do with R. You can see that "Data A" has 9 records, while "Data B" has only 4 records. As these values are so precious, I will not delete "Data A" to fit the rows of "Data B". Instead, I will fill the "missing" data in "Data B". The things need to do can be separated into two parts:
(Part Ⅰ)
① To add blank rows for "Data B" in the corresponding location, according to "Data A"; ② In these blank rows (blue in Fig.1), copy the corresponding date. The result at the end of Part Ⅰ is like Fig.1.
(Part Ⅱ)
To interpolate the missing data in "B_Value". This part has been solved. You can see the solution in here of Stack Overflow.
Could someone give me advice about it (especially Part Ⅰ)? Thanks.
library(tidyverse)
# example data
dt_A = data.frame(Date = c("01/04/2016 10:53", "02/04/2016 10:54", "03/04/2016 10:55"),
A_Value = c(5,6,7))
dt_B = data.frame(Date = c("01/04/2016 10:53", "03/04/2016 10:55"),
B_Value = c(1,3))
# complete dates of data B using dates of data A
dt_B %>% complete(Date = dt_A$Date)
# # A tibble: 3 x 2
# Date B_Value
# <chr> <dbl>
# 1 01/04/2016 10:53 1
# 2 02/04/2016 10:54 NA
# 3 03/04/2016 10:55 3
Using merge:
# data stolen from #AntoniosK's post
dt_A = data.frame(Date = c("01/04/2016 10:53", "02/04/2016 10:54", "03/04/2016 10:55"),
A_Value = c(5,6,7))
dt_B = data.frame(Date = c("01/04/2016 10:53", "03/04/2016 10:55"),
B_Value = c(1,3))
# keep dates as date
dt_A$Date <- as.POSIXct(dt_A$Date, format="%m/%d/%Y %H:%M")
dt_B$Date <- as.POSIXct(dt_B$Date, format="%m/%d/%Y %H:%M")
# then merge and sort on date
res <- merge(dt_B, dt_A[, "Date", drop = FALSE], all.y = TRUE)
res <- res[ order(res$Date), ]
res
# Date B_Value
# 1 2016-01-04 10:53:00 1
# 2 2016-02-04 10:54:00 NA
# 3 2016-03-04 10:55:00 3

Create several new derived variables from existing variables in data.frame

In R I have a data.frame that has several variables that have been measured monthly over several years. I would like to derive the monthly average (using all years) for each variable. Ideally these new variables would all be together in a new data.frame (carrying over the ID), below I am simply adding the new variable to the data.frame. The only way I know how to do this at the moment (below) seems quite laborious, and I was hoping there might be a smarter way to do this in R, that would not require typing out each month and variable as I did below.
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...
Here's a solution using data.table development version v1.8.11 (which has melt and cast methods implemented for data.table):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
You can just cbind this to your original data.
Note that sum is a primitive where as mean is S3 generic. Therefore, using sum(.)/length(.) is better (as if there are too many groupings, dispatching the right method with mean for every group could be quite a time-consuming operation). .N is a special variable in data.table that directly gives you the length of the group.
Here is a solution using reshape2 that is more automated when you have lots of data and uses regular expressions to extract the variable name and the month. This solution will give you a nice summary table.
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
Or to be compatible with the other solutions, and return the table by ID as well...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0
This is pretty straight forward in base R.
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
This gives you your original data.frame with the following four columns at the end:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0
You can use Reshape from package {splitstackshape} and then use plyr package or data.table or base R to perform mean.
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0

Melt and Recast into a new dataframe in r

I just downloaded a lot of temperature data from one of our dataloggers. The dataframe gives me mean hourly observations of temperature for 1691 hours for 87 temperature sensors (so there is a lot of data here). This looks something like this
D1_A D1_B D1_C
13.43 14.39 12.33
12.62 13.53 11.56
11.67 12.56 10.36
10.83 11.62 9.47
I would like to reshape this dataset into a matrix that looks like this:
#create a blank matrix 5 columns 131898 rows
matrix1<-matrix(nrow=131898, ncol=5)
colnames(matrix1)<- c("year", "ID", "Soil_Layer", "Hour", "Temperature")
where:
year is always "2012"
ID corresponds to the header ID (e.g. D1)
Soil_Layer corresponds to the second bit of the header (e.g. A, B, or C)
Hour= 1:1691 for each sensor
and Temperature= the observed values in the original dataframe.
Can this be done with the reshape package in r? Does this need to be done as a loop? Any input on how to handle this dataset would be useful. Cheers!
I think this does what you want...you can take advantage of the colsplit() and melt() functions in package reshape2. It's not clear where you identify the Hour for the data, so I assumed it was ordered from the original dataset. If that's not the case, update your question:
library(reshape2)
#read in your data
x <- read.table(text = "
D1_A D1_B D1_C
13.43 14.39 12.33
12.62 13.53 11.56
11.67 12.56 10.36
10.83 11.62 9.47
9.98 10.77 9.04
9.24 10.06 8.65
8.89 9.55 8.78
9.01 9.39 9.88
", header = TRUE)
#add hour index, if data isn't ordered, replace this with whatever
#tells you which hour goes where
x$hour <- 1:nrow(x)
#Melt into long format
x.m <- melt(x, id.vars = "hour")
#Split into two columns
x.m[, c("ID", "Soil_Layer")] <- colsplit(x.m$variable, "_", c("ID", "Soil_Layer"))
#Add the year
x.m$year <- 2012
#Return the first 6 rows
head(x.m[, c("year", "ID", "Soil_Layer", "hour", "value")])
#----
year ID Soil_Layer hour value
1 2012 D1 A 1 13.43
2 2012 D1 A 2 12.62
3 2012 D1 A 3 11.67
4 2012 D1 A 4 10.83
5 2012 D1 A 5 9.98
6 2012 D1 A 6 9.24

Resources