Splitting Columns that contain decimal point values in R - r

I facing difficulties splitting columns in R. For instance
Col1.Col2.Col3
12.3,10,11
11.3,11,50
85,89.3,90
and over 100x records
I did
tidyr::separate(df, Col1.Col2.Col3,
c("Col1", "Col2", "Col3" ))
And i get
Col1 Col2 Col3
12 3 10
11 3 11
85 89 3
and over 100x records
I realised that the decimal value is moved to the next column and the values of Col3 were left out. How can i fix this or is there a better way of splitting the columns?

tidyr::separate has a sep argument that controls where the splits occur. Use sep = ",".

Related

Adding a new column to one dataframe using a column from a second consisting of vectors with strings in R

I'm trying to add a new column to a dataframe, the data will be filled from column A in second dataframe and depend on column B. The values in Column B are vectors. The length of vectors in my real data are far too long for an if_else statement, I'm looking for something more along the lines of stringer or grepl recognizing character strings but cycling through the rows of the dataframe like a for loop.
df1 <- data.frame(col1 = c('siteA', 'siteB', 'siteC', 'siteD'),
col2 = c('ecoA', 'ecoB', 'ecoC', 'ecoD'))
df2 <- data.frame(colA = c('type1', 'type2'),
colB = c("c('ecoA','ecoC')", "c('ecoB','ecoD')"))
I've tried merge, mutate with if statements, joins (dplyr), and case when, but again these either didn't fill all of the rows or they're far too long/complicated for the data set I have.
This is the end result I'm hoping for:
|col1 | col2 |colA |
|-----|------|-----|
|siteA| ecoA |type1|
|siteB| ecoB |type2|
|siteC| ecoC |type1|
|siteD| ecoD |type2|
You could parse colB character string to transform it to a vector of character strings and unnest it:
df2$colB <- lapply(df2$colB,function(x) eval(parse(text=x)))
df2 <- tidyr::unnest(df2,cols=colB)
dplyr::inner_join(df1,df2,by = c(col2="colB"))
col1 col2 colA
1 siteA ecoA type1
2 siteB ecoB type2
3 siteC ecoC type1
4 siteD ecoD type2

Nested for loops, different in R

d3:
Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22
I am using this loop:
for ( i in 1:(nrow (d3)-1) ){
for (j in (i+1):nrow(d3)) {
if(c(i) == c(j)) {
print(c(j))
# d4 <- subset.data.frame(c(j))
}
}
}
I want to compare all the rows in Col1 and eliminate the ones that are not the same. Then I want to output a data frame with only the ones that have the same values in col1.
Expected Output:
Col1 Col2
PBR565 22
PBR565 22
PBR565 22
Not sure whats up with my nested loop? Is it because I don't specify the col names?
The OP has requested to compare all the rows in Col1 and eliminate the ones that are not the same.
If I understand correctly, the OP wants to remove all rows where the value in Col1 appears only once and to keep only those rows where the values appears two or more times.
This can be accomplished by finding duplicated values in Col1. The duplicated() function marks the second and subsequent appearences of a value as duplicated. Therefore, we need to scan forward and backward and combine both results:
d3[duplicated(d3$Col1) | duplicated(d3$Col1, fromLast = TRUE), ]
Col1 Col2
2 PBR565 22
3 PBR565 22
4 PBR565 22
The same can be achieved by counting the appearances using the table() function as suggested by Ryan. Here, the counts are filtered to keep only those entries which appear two or more times.
t <- table(d3$Col1)
d3[d3$Col1 %in% names(t)[t >= 2], ]
Please, note that this is different from Ryan's solution which keeps only the rows whose value appears most often. Only one value is picked, even in case of ties. (For the given small sample dataset both approaches return the same result.)
Ryan's answer can be re-written in a slightly more concise way
d3[d3$Col1 == names(which.max(t)), ]
Data
d3 <- data.table::fread(
"Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22", data.table = FALSE)

R custom parser function

I have data in txt file in this form:
col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
I want to stress that this columns are not tab seperated. They are only one or more space seperated. And I col3 and col5 contains data that is composed of space seperated words.
Actually rows are fixed length. To make it clear:
44 PT-222 My name 829302 24.02.14 01.53.51.000000 AM
1 PT-1 This is not user and John 829 24.02.14 01.40.47.000000 AM
How can I read that txt file into a table?
Is there any custom seperator function reading 1 line, so that I can override it?
If the fields are fixed width you can use read.fwf. Otherwise, we can use read.pattern in the gsubfn package. (Below we can replace text = Lines with something like "myfile.dat" .) First we read in the column names cn separately since they are not in the same format as the data. Then we skip over the first two lines of the file since the data begins in the third line and we read in the data using an appropriate pattern, pat:
Lines <- "col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM"
library(gsubfn)
cn <- read.table(text = Lines, nrow = 1, as.is = TRUE)
pat <- "^ *(\\S+) +(\\S+) +(.*\\S) +(\\S+) +(\\S+ \\S+ \\S+) *$"
DF <- read.pattern(text = Lines, pattern = pat, skip = 2,
col.names = cn, as.is = TRUE)
giving:
> DF
col1 col2 col3 col4 col5
1 44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
2 11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
Note that the pattern we used assumes that no fields are empty. Any rows that do not match the pattern are silently dropped. The skip=2 is optional since the first two rows would be ignored in any case since they do not match the pattern.

Read fixed-width format, where the widths are inferred from the column headers

I have a rather odd file format that I need to read. It has space-separated columns, but the column widths must be inferred from the header.
In addition, there are some bogus lines that must be ignored, both blank and non-blank.
A representation of the data:
The first line contains some text that is not important, and shoud be ignored.
The second line also. In addition, the third and fifth lines are blank.
col1 col2 col3 col4 col5
ab cd e 132399.4 101 0 17:25:24 Ignore anything past the last named column
blah 773411 25 10 17:25:25 Ignore this too
Here, the first column, col1, contains the text from the beginning of the line until the character position of the end of the text string col1. The second column, col2 contains the text from the next character following the 1 in col1 until the end of the text string col2. And so on.
In reality, there are 17 columns rather than 5, but that should not change the code.
I'm looking for a data frame with the contents:
col1 col2 col3 col4 col5
1 ab cd e 132399.4 101 0 17:25:24
2 blah 773411.0 25 10 17:25:25
Here is a rather inelegant approach:
read.tt <- function(file) {
con <- base::file(file, 'r')
readLines(con, n=3);
header <- readLines(con, n=1)
close(con)
endpoints <- c(0L, gregexpr('[^ ]( |$)', header)[[1]])
widths <- diff(endpoints)
names <- sapply(seq_along(widths),
function(i) substr(header, endpoints[i]+1, endpoints[i]+widths[i]))
names <- sub('^ *', '', names)
body <- read.fwf(file, widths, skip=5)
names(body) <- names
body
}
There must be a better way.
The lines to be ignored is a minor piece of this puzzle. I'll accept a solution that works with these already removed from the file (but of course would prefer one that does not need preprocessing).
If you know your header line, you can get widths using following method.
x
## [1] " col1 col2 col3 col4 col5"
nchar(unlist(regmatches(x, gregexpr("\\s+\\S+", x))))
## [1] 13 9 5 5 10

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Resources