Split data contained in one column into 3 columns in R - r

I have a dataset containing character vectors (that are really numbers) that i want to split into 3 different columns. These 3 columns need to have the 3 numbers contained in the original column.
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))`
colnames(Data)<- "values"
Data
values
1.50 (1.30 to 1.70)
1.30 (1.20 to 1.50)
The result i expect is this.
value1 value2 value3
1.50 1.30 1.70
1.30 1.20 1.50

One way of doing this can be to use the seperate in package tidyr. From the documentation : Separate a character column into multiple columns with a regular expression or numeric locations
Adapting form the example in documentation, using decimal, and using extra="drop" for dropping discarded data without warnings :
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)")))
colnames(Data)<- "values"
Data
require(tidyr)
separate(Data, col = values, into = paste0("value",1:3),
sep = "[^[:digit:]?\\.]+" , extra="drop")
#output
value1 value2 value3
> 1 150 0.130 170.0
> 2 13.02 120 150.5

We can also use extract specifying the regex pattern to extract data.
tidyr::extract(Data, values, paste0("value",1:3),
regex = '(\\d+\\.\\d+)\\s\\((\\d+\\.\\d+)\\sto\\s(\\d+\\.\\d+)\\)')
# value1 value2 value3
#1 1.50 1.30 1.70
#2 1.30 1.20 1.50
(\\d+\\.\\d+) is used to extract a decimal value
\\s is whitespace.
We use capture groups to extract the value in three different columns.

You can try this code:
library(easyr)
x = data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))
colnames(x)[1] = "val"
x$val1 = left(x$val, 4)
x$val2 = mid(x$val, 7,4)
x$val3 = mid(x$val, 15,4)

Related

Find columns with different values in duplicate rows

I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12

How do I wrangle messy, raw data and import into R?

I have raw, messy data for time series containing around 1400 observations. Here is a snippet of what it looks like:
[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null] ... etc
I want to pull the date and its respective value to form a tsibble in R. So, from the above values, it would be like
Date
y-variable
2021-08-24
1.67
2021-08-23
1.65
2021-08-22
1.62
Notice how only the first value is to be paired with its respective date - I don't need the other values. Right now, the raw data has been copied and pasted into a word document and I am unsure about how to approach data wrangling to import into R.
How could I achieve this?
#replace the text conncetion with a file connection if desired, the file should be a txt then
input <- readLines(textConnection("[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"))
#insert line breaks
input <- gsub("],[", "\n", input, fixed = TRUE)
#remove "new Date"
input <- gsub("new Date", "", input, fixed = TRUE)
#remove parentheses and brackets
input <- gsub("[\\(\\)\\[\\]]", "", input, perl = TRUE)
#import cleaned data
DF <- read.csv(text = input, header = FALSE, quote = "'")
DF$V1 <- as.Date(DF$V1)
print(DF)
# V1 V2 V3 V4 V5
#1 2021-08-24 1.67 1.68 0.9 null
#2 2021-08-23 1.65 1.68 0.9 null
#3 2021-08-22 1.62 1.68 0.9 null
How is this?
text <- "[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"
df <- read.table(text = unlist(strsplit(gsub('new Date\\(|\\)', '', gsub('^.(.*).$', '\\1', text)), "].\\[")), sep = ",")
> df
V1 V2 V3 V4 V5
1 2021-08-24 1.67 1.68 0.9 null
2 2021-08-23 1.65 1.68 0.9 null
3 2021-08-22 1.62 1.68 0.9 null
Changing column names and removing the last columns is trivial from this point

Remove spaces and convert values to as numeric when only numbers in R

I've been looking in the web for this problem
The thing different from usual findings is that I have columns where I have numbers and other values different from plain numbers.
Say for example:
df <- data.frame('Col1' = c('421', ' 0.52', '-0.88 ', '1.2 (ref)', ' 97 '),
'Col2' = c('0.0', '0.27,0.91', '3.0', ' 10242.3', ' 94.5'))
I would like to remove spaces from the cells only composed by numbers. Not sure if, for example, 0.52, that dot character makes it still be considered as number. Also in -0.88 the - character.
So far I would use
library(stringr)
# Remove spaces
df$Col1 <- str_replace_all(df$Col1, "\\s+", "")
library(dplyr)
# Convert to as.numeric
df %>%
mutate_all(funs(as.numeric(as.character(.)))
But I would not like to just replace every single space, for example in the value 1.2 (ref), I would like to keep that space. Also, not to change every value to as.numeric, only where pure numbers, or \d+\.\d+, or \-\d+\.\d+ (regex)
Also if I attempt to convert to as.numeric, the numeric values somehow change drastically, I understand this is because of the spaces present in the values.
Thanks in advance
You have several issues as pointed out by akrun and Henrik: as columns in a data frame can only be of the same class, the 1.2(ref) value forces the column to be of class character. Also, in Col2 there is this entry: 0.27,0.91. This looks like two values and you need to decide how to deal with this.
Suggestions: split Col1 into two columns. One column holds the numeric values and the other holds the value ref or NA. This can be a character or factor column. As for the double numeric value: split into two columns or decided which value you would like to retain.
Under these assumptions your code could be something like this (using the tidyverse approach):
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame('Col1' = c('421', ' 0.52', '-0.88 ', '1.2 (ref)', ' 97 '),
'Col2' = c('0.0', '0.27,0.91', '3.0', ' 10242.3', ' 94.5'))
df <- df %>%
mutate_all(.funs = funs(str_trim)) %>% # remove leading and trailing spaces
separate(col = Col1, into = c("Value_1", "Reference"), sep = "\\s|,") %>% # split into 2 columns at comma or space
separate(col = Col2, into = c("Value_2", "Value_3"), sep = "\\s|,") %>% # split into 2 columns at comma or space
mutate_at(.vars = vars(starts_with("Value")), as.numeric) #convert character to numeric
This code does not scale well: if your dataset would have many columns and each column requires to be split in a different way things would get complicated. Better to review your dataset first and do some quality control on it. If any column can contains comma separated values: you can write code to catch that and apply a correction in uniform way. Combinations of values and text is something you should not allow in your dataset.
Output:
> glimpse(df)
Observations: 5
Variables: 4
$ Value_1 <dbl> 421.00, 0.52, -0.88, 1.20, 97.00
$ Reference <chr> NA, NA, NA, "(ref)", NA
$ Value_2 <dbl> 0.00, 0.27, 3.00, 10242.30, 94.50
$ Value_3 <dbl> NA, 0.91, NA, NA, NA
> df
Value_1 Reference Value_2 Value_3
1 421.00 <NA> 0.00 NA
2 0.52 <NA> 0.27 0.91
3 -0.88 <NA> 3.00 NA
4 1.20 (ref) 10242.30 NA
5 97.00 <NA> 94.50 NA
I built a function using a regex
library(tidyverse)
mClean <- function(strVec){
pass1 <- strVec %>%
str_trim() %>%
str_extract("(?x) # Perl-style whitespace
^[\\+\\-]? # An optional leading +/-
\\d+ # the integer part
(\\.\\d+)? # A fractional part
") %>%
as.numeric()
}
I put your data in a tibble and ran it:
df <- tibble('Col1' = c('421', ' 0.52', '-0.88 ', '1.2 (ref)', ' 97 '),
'Col2' = c('0.0', '0.27,0.91', '3.0', ' 10242.3', ' 94.5')) %>%
mutate(cln1 = as.numeric(mClean(Col1)),
cln2 = as.numeric(mClean(Col2)))
df
# A tibble: 5 x 4
Col1 Col2 cln1 cln2
<chr> <chr> <dbl> <dbl>
1 421 0.0 421 0
2 " 0.52" 0.27,0.91 0.52 0.27
3 "-0.88 " 3.0 -0.88 3
4 1.2 (ref) " 10242.3" 1.2 10242.
5 " 97 " " 94.5" 97 94.5
I wasn't sure what you wanted done with that '0.27,0.91'. Break it into two rows? Make another column for the '0.91'? Anyway, this keeps the original input in the same row as the cleaned up values.

R: Aggregating over several variables and observations (depending on values) and creating a new variable

The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.

Selecting rows or columns by data type?

Would it be possible to search for a column or row that matches a given type, such as numeric or POSIXCT?
For example, if you had a table like so:
arizona.trees:
arizona.trees
group redwoods diameter date
A 23 2.19 2017-8-20 08:12:56
A 24 3.14 2017-8-22 08:15:54
B 9 5.16 2017-8-20 08:15:40
B 10 8.99 2017-8-21 18:15:45
C 88 7.30 2017-8-23 23:55:55
Would it be possible to try and search for all columns of type POSICXT, which would return the date column?
You can get get the names of columns of a particular data type with
names(arizona.trees)[sapply(arizona.trees, is, "numeric")]
names(arizona.trees)[sapply(arizona.trees, is, "POSIXt")]
If you wanted to do something to those columns, the dplyr library has the mutate_if/summarize_if/select_if verbs
arizona.trees %>% select_if(is.numeric)
arizona.trees %>% summarize_if(is.numeric, mean)

Resources