Extracting the numbers from the data frame - r

I have a data frame with a "Calculation" column, which could be reproduced by the following code:
a <- data.frame(Id = c(1:3), Calculation = c('[489]/100','[4771]+[4777]+[5127]+[5357]+[5597]+[1044])/[463]','[1044]/[463]'))
> str(a)
'data.frame': 3 obs. of 2 variables:
$ Id : int 1 2 3
$ Calculation: Factor w/ 3 levels "[1044]/[463]",..: 3 2 1
Please note that there are two types of numbers in "Calculation" column: most of them are surrounded by brackets, but some (in this case the number 100) is not (this has a meaning in my application).
What I would like to do is to extract all the distinct numbers that appear in Calculation column to return a vector with the union of these numbers. Ideally, I would like to be able to distinguish between the numbers that are between brackets and the numbers that are not. This step is not so important (if it makes it complicated) since the numbers that are NOT between the brackets are few and I can manually detect them. So the desired output in this case would be:
b = c(489,4771,4777,5127,5357,5597,1044,463)
Thanks in advance

We can use str_extract_all from library(stringr). Using the regex lookbehind ((?<=\\[)), we match the numbers \\d+ that is preceded by [, extract them in a list, unlist to convert it to vector and then change the character to numeric (as.numeric), and get the unique elements.
library(stringr)
unique(as.numeric(unlist(str_extract_all(a$Calculation, '(?<=\\[)\\d+'))))
#[1] 489 4771 4777 5127 5357 5597 1044 463

Related

How to find vectors with duplicate values in a row?

I have a lot of vectors, which looks something like this:
a <- c(0,0,0,1,1)
b <- c(1,0,0,0,1)
c <- c(0,0,1,1,1)
In all of these vectors have the values that are repeated three times in succession.
I need to somehow identify these repetitions. The main condition is that the value of repeated one after the other.
Duplicated() will not help, at least in the base.
The definition of such vectors is necessary in order then to remove them.
A suitable vector for my work.
d <- c(1,0,1,0,0)
Improper vector.
e <- c(1,1,1,0,0)
You might want to take a look at the rle from the base package or the rleid function from data.table.
rle(c(0,0,0,1,1))
Run Length Encoding
lengths: int [1:2] 3 2
values : num [1:2] 0 1
library(data.table)
rleid(c(0,0,0,1,1))
[1] 1 1 1 2 2
Both will look at runs of the same number. The rle function returns a list of lengths and values, and the rleid function returns a vector counting up each time the number in the series changes.

List all possible occurrences within a column?

I am trying to merge a data.frame and a column from another data.frame, but have so far been unsuccessful.
My first data.frame [Frequencies] consists of 2 columns, containing 47 upper/ lower case alpha characters and their frequency in a bigger data set. For example purposes:
Character<-c("A","a","B","b")
Frequency<-(100,230,500,420)
The second data.frame [Sequences] is 93,000 rows in length and contains 2 columns, with the 47 same upper/ lower case alpha characters and a corresponding qualitative description. For example:
Character<-c("a","a","b","A")
Descriptor<-c("Fast","Fast","Slow","Stop")
I wish to add the descriptor column to the [Frequencies] data.frame, but not the 93,000 rows! Rather, what each "Character" represents. For example:
Character<-c("a")
Frequency<-c("230")
Descriptor<-c("Fast")
Following can also be done:
> merge(adf, bdf[!duplicated(bdf$Character),])
Character Frequency Descriptor
1 a 230 Fast
2 A 100 Fast
3 b 420 Stop
4 B 500 Slow
Why not:
df1$Descriptor <- df2$Descriptor[ match(df1$Character, df2$Character) ]

How can I specify which columns to select using read.table in R

I have a dataset with 100 columns and it doesn't have a header.
I have an int vector that consists of some numbers ranges between 1 to 100. For example, a vector with "2 5 62 78".
Now when I read the dataset using read.table, all I want is to select column 2, 5, 62 and 78 from the dataset. How can I do that? Many thanks.
What you want is the option colClasses of read.table() (and the derivative functions). It allows you to pass a character vector with the classes of each column in the data. If you set that to "NULL" the column will be skipped. You can set the whole thing to "NULL" and then only change the ones you want to import (based on their class).
Proof of concept below.
cc <- rep('NULL', 100) ## skip all 100 columns
cc[c(2, 5)] <- 'integer' ## 2 and 5 are integer
cc[c(62, 58)] <- 'character' ## 62 and 58 will be imported as character
df <- read.csv('really-wide-data.csv', colClasses=cc)

Extracting consecutive occurences in R (like unix uniq)

I'm beginning to analyse datas for my thesis. I first need to count consecutive occurences of strings as one. Here's a sample vector :
test <- c("vv","vv","vv","bb","bb","bb","","cc","cc","vv","vv")
I would like to simply extract unique values, as in the unix command uniq. So expected output would be a vector as :
"vv","bb","cc","vv"
I looked at rle function, wich seems to be fine, but how would I get the output of rle as a vector ? I don't seem to understand the rle class...
> rle(test)
Run Length Encoding
lengths: int [1:5] 3 3 1 2 2
values : chr [1:5] "vv" "bb" "" "cc" "vv"
How to get one vector of the values output by rle and another one for the lengths ? Hope I'm making myself clear...
Thanks again for any help !
rle() returns a two-element list of class "rle"; as #gsk points out, you can use ordinary list-indexing constructs to access the component vectors.
Also, try this, to put the results of rle into a more familiar format:
as.data.frame(rev(unclass(rle(test))))
# values lengths
# 1 vv 3
# 2 bb 3
# 3 1
# 4 cc 2
# 5 vv 2
Source: http://www.sigmafield.org/2009/09/22/r-function-of-the-day-rle
Solution: rle(test)$values
They use: coin.rle <- rle(coin) and coin.rle$values so, rle(test)$values should work.

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources