I want to separate a column which contains dates and items into two columns.
V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent
To this
V1 V2
23/2/2000 shampoo
24/2/2000 flour
21/10/2000 poultry
17/4/2001 laundry detergent
My problem is that there's no separation between the two. The date length isn't uniform (it's in the format of 1/1/2000 instead of 01/01/2000) so I can't separate by character length. The dataset also covers multiple years.
One option would be separate from tidyr. We specify the sep with a regex lookaround to split between digit and a lower case letter
library(dplyr)
library(tidyr)
df1 %>%
separate(V1, into = c("V1", "V2"), sep="(?<=[0-9])(?=[a-z])")
# V1 V2
#1 23/2/2000 shampoo
#2 24/2/2000 flour
#3 21/10/2000 poultry
#4 17/4/2001 laundry detergent
Or with read.csv after creating a delimiter with sub
read.csv(text = sub("(\\d)([a-z])", "\\1,\\2", df1$V1),
header = FALSE, stringsAsFactors = FALSE)
data
df1 <- structure(list(V1 = c("23/2/2000shampoo", "24/2/2000flour",
"21/10/2000poultry",
"17/4/2001laundry detergent")), class = "data.frame", row.names = c(NA,
-4L))
You could also use capture groups with tidyr::extract(). The first group \\d{1,2}/\\d{1,2}/\\d{4} get the date in the format you posted, and the second group [[:print:]]+ grabs at least one printable character.
library(tidyverse)
df1 %>%
extract(V1, c("V1", "V2"), "(\\d{1,2}/\\d{1,2}/\\d{4})([[:print:]]+)")
V1 V2
1 23/2/2000 shampoo
2 24/2/2000 flour
3 21/10/2000 poultry
4 17/4/2001 laundry detergent
Data:
df1 <- readr::read_csv("V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent")
You can also use :
data <- data.frame(V1 = c("23-02-2000shampoo", "24-02-2001flour"))
library(stringr)
str_split_fixed(data$V1, "(?<=[0-9])(?=[a-z])", 2)
[,1] [,2]
[1,] "23-02-2000" "shampoo"
[2,] "24-02-2001" "flour"
I have a dataframe with Species names plus V1, V4 or V9 (or just the species name) on a column, and I have a column with many Order names which repeat along side the column.
What I need is something that helps me count the number of times it matches V1 of the first column with every order, and then the same for V4 and V9.
I've tried this for the V1 count:
countordens <- malardf %>%
group_by(ordens) %>%
summarise(V1=(sum(str_count(malardf$malar_names, pattern="V1"))))
But it returns a column with orders grouped, but with the total amount of V1 on the dataframe, instead of the total amount of V1 for each Order.
malar_names malaordens.Order
1 Protomima imitatrix V1 V9 Amphipoda
2 Caprella danilevskii V1 V9 Amphipoda
3 Caprella andreae Amphipoda
4 Caprella andreae Amphipoda
5 Caprella andreae Amphipoda
6 Caprella andreae Amphipoda
I'm hoping to get a dataframe with each order from Orders only once, and a second column with the amount of times it matches that order with "V1" on the dataframe, and another for "V4" and another for "V9".
If we want to get the counts of multiple values, then use map
library(tidyverse)
map(c("V1", "V4", "V9"), ~
malardf %>%
group_by(malaordens.Order) %>%
summarise(!! .x := sum(str_count(malar_names,
pattern = .x)))) %>%
reduce(inner_join, by = "malaordens.Order")
# A tibble: 1 x 4
# malaordens.Order V1 V4 V9
# <chr> <int> <int> <int>
#1 Amphipoda 2 0 2
Note that the OP's issue in the code while counting a single pattern also stems from extracting whole column (malardf$) after doing the group_by. Within in mutate/summarise, there is no need to use data$, instead just pass the unquoted column name. It would work invariably with or without group by operation
data
malardf <- structure(list(malar_names = c("Protomima imitatrix V1 V9",
"Caprella danilevskii V1 V9",
"Caprella andreae", "Caprella andreae", "Caprella andreae", "Caprella andreae"
), malaordens.Order = c("Amphipoda", "Amphipoda", "Amphipoda",
"Amphipoda", "Amphipoda", "Amphipoda")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
I have two data.tables in this format (the actual tables have about a million rows in each):
library(data.table)
dt1 <- data.table(
code=c("A001", "A002","A003","A004","A005"),
x=c(65,92,25,450,12),
y=c(98,506,72,76,15),
no1=c(010101, 010156, 028756, 372576,367383),
no2=c(876362,"",682973,78269,"")
)
dt2 <- data.table(
code=c("A003", "A004","A005","A006","A007","A008","A009"),
x=c(25,126,12,55,34,134,55),
y=c(72,76,890,568,129,675,989),
no1=c(028756, 372576,367383,234876, 287156, 123348, 198337),
no2=c(682973,78269,65378,"","","",789165)
)
I would like to combine the two together and keep only unique rows based on all column entries being unique. This is what I have but I assume there is a better way of doing it:
dt3 <- rbindlist(list(dt1, dt2))
dt3 <- unique(dt3, by = c("code", "x", "y", "no1", "no2"))
Once I have this single dataset I would like to give any duplicate 'code' records some attribute information (version number and a comment about what's different in that version to the previous one). The output I am looking for would be this:
dt4 <- data.table(
code=c("A001", "A002","A003","A004","A005", "A004","A005","A006","A007","A008","A009"),
x=c(65,92,25,450,12,126,12,55,34,134,55),
y=c(98,506,72,76,15,76,890,568,129,675,989),
no1=c(010101, 010156, 028756, 372576,367383, 372576,367383,234876, 287156, 123348, 198337),
no2=c(876362,"",682973,78269,"",78269,65378,"","","",789165),
version = c("V1","V1","V1","V1","V1","V2","V2","V1","V1","V1","V1"),
unique_version=c("A001_V1", "A002_V1","A003_V1","A004_V1","A005_V1", "A004_V2","A005_V2","A006_V1","A007_V1","A008_V1","A009_V1"),
comment = c("First_entry","First_entry","First_entry","First_entry","First_entry","New_x", "New_y_and_no2","First_entry","First_entry","First_entry","First_entry")
)
I'm not sure how to achieve dt4 (and in an efficient way considering the size of the real dataset will be over a million rows).
Edit
Having applied #Chase's solution to my real data I noticed my dt3 example varies slightly from the type of result I am getting. This looks more like my real data:
dt6 <- data.table(
code=c("A111", "A111","A111","A111","A111", "A111","A111","A234", "A234","A234","A234","A234", "A234","A234"),
x=c("",126,126,"",836,843,843,126,126,"",127,836,843,843),
y=c("",76,76,"",456,465,465,76,76,"",77,456,465,465),
no1=c(028756, 028756,028756,057756, 057756, 057756, 057756,028756, 028756,057756,057756, 057756, 057756, 057756),
no2=c("","",034756,"","","",789165,"",034756,"","","","",789165)
)
comp_cols <- c("x", "y", "no1", "no2")
#grabs the names of the mismatching values and formats them how you did
f <- function(x,y) {
n_x <- names(x)
diff <- x != y
paste0("New_", paste0(n_x[diff], collapse = "_and_"))
}
dt6[, version := paste0("V", 1:.N), by = code]
dt6[, unique_version := paste(code, version, sep = "_")]
dt6[, comment := ifelse(version == "V1", "First_entry", f(.SD[1], .SD[2])), by = code, .SDcols = comp_cols]
As you can see the suggested solution to create the comment column seems to be returning only the first change between the first and second versions (and not the changes better V2 and V3 etc.)
Here's one solution - the first two are trivial, the comment takes a little more thought:
dt5 <- copy(dt3)
comp_cols <- c("x", "y", "no1", "no2")
#grabs the names of the mismatching values and formats them how you did
f <- function(x,y) {
n_x <- names(x)
diff <- x != y
paste0("New_", paste0(n_x[diff], collapse = "_and"))
}
dt5[, version := paste0("V", 1:.N), by = code]
dt5[, unique_version := paste(code, version, sep = "_")]
dt5[, comment := ifelse(version == "V1", "First_entry", f(.SD[1], .SD[2])), by = code, .SDcols = comp_cols]
End up yielding this:
> dt5
code x y no1 no2 version unique_version comment
1: A001 65 98 10101 876362 V1 A001_V1 First_entry
2: A002 92 506 10156 V1 A002_V1 First_entry
3: A003 25 72 28756 682973 V1 A003_V1 First_entry
4: A004 450 76 372576 78269 V1 A004_V1 First_entry
5: A005 12 15 367383 V1 A005_V1 First_entry
6: A004 126 76 372576 78269 V2 A004_V2 New_x
7: A005 12 890 367383 65378 V2 A005_V2 New_y_andno2
8: A006 55 568 234876 V1 A006_V1 First_entry
9: A007 34 129 287156 V1 A007_V1 First_entry
10: A008 134 675 123348 V1 A008_V1 First_entry
11: A009 55 989 198337 789165 V1 A009_V1 First_entry
I want to read a text file into R, but I got a problem that the first column are mixed with the column names and the first column numbers.
Data text file
revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855
R Code:
data.predicted_values = read.table("predicted_values.txt", sep=",")
Output:
V1 V2 V3 V4 V5 V6
1 revenues 4118000000.0 4315000000 4512000000 4709000000 4906000000 5103000000
2 cost_of_revenue-1595852945.4985902 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
3 gross_profit 2522147054.5014095 2663170808 2806054293 2950797511 3097400462 3245863145
How can I split the first column into two parts? I mean I want the first column V1 is revenues,cost_of_revenue, gross_profit. V2 is 4118000000.0,-1595852945.4985902,2522147054.5014095. And so on and so forth.
This is along the same lines of thinking as #DWin's, but accounts for the negative values in the second row.
TEXT <- readLines("predicted_values.txt")
A <- gregexpr("[A-Za-z_]+", TEXT)
B <- read.table(text = regmatches(TEXT, A, invert = TRUE)[[1]], sep = ",")
C <- cbind(FirstCol = regmatches(TEXT, A)[[1]], B)
C
# FirstCol V1 V2 V3 V4 V5 V6
# 1 revenues 4118000000 4315000000 4512000000 4709000000 4906000000 5103000000
# 2 cost_of_revenue -1595852945 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
# 3 gross_profit 2522147055 2663170808 2806054293 2950797511 3097400462 3245863145
Since you have no commas btwn the rownames and the values you need to add them back in:
txt <- "revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855"
Lines <- readLines( textConnection(txt) )
# replace textConnection(.) with `file = "predicted_values.txt"`
res <- read.csv( text=sub( "(^[[:alpha:][:punct:]]+)(\\s|-)" ,
"\\1,", Lines) ,
header=FALSE, row.names=1 )
res
The decimal fractions may not print but they are there.
You want the row.names argument of read.table. Then you can simply transpose your data:
data.predicted_values = read.table("predicted_values.txt", sep=",", row.names=1)
data.predicted_values <- t(data.predicted_values)
I am using prettyNum in combination with xtable to produce nice LaTeX tables. It works as expected when I have more than one row in the data.frame. But when I only have one row it fails because prettyNum converts the data.frame to a character vector.
Is there a simple way similar to "drop = FALSE" to retain a data.frame from prettyNum?
df <- data.frame(a = 1, b ='a')
prettyDf <- prettyNum(df)
class(prettyDf)
[1] "character"
There are two basic options you can use: convert the output to a one-row matrix before using xtable, or you can use apply to convert to prettyNum.
Here is a minimal example:
First, some sample data. A one row data.frame and a three row data.frame:
df <- structure(
list(V1 = 76491283764.9743, V2 = 29.12345678901, V3 = -7.1234, V4 = -100.1,
V5 = 1123), .Names = c("V1", "V2", "V3", "V4", "V5"),
row.names = c(NA, -1L), class = "data.frame")
df
# V1 V2 V3 V4 V5
# 1 76491283765 29.12346 -7.1234 -100.1 1123
df2 <- rbind(df, df*2, df*3)
df2
# V1 V2 V3 V4 V5
# 1 76491283765 29.12346 -7.1234 -100.1 1123
# 2 152982567530 58.24691 -14.2468 -200.2 2246
# 3 229473851295 87.37037 -21.3702 -300.3 3369
Using xtable on the 3-row one works fine, but unlike what you stated in your question, it does not retain a data.frame. It converts the output to a character matrix.:
> class(prettyNum(df2, big.mark=","))
[1] "matrix"
> xtable(prettyNum(df2, big.mark=","), align = "rrrrrr")
% latex table generated in R 2.15.3 by xtable 1.7-1 package
% Fri Mar 22 15:46:08 2013
\begin{table}[ht]
\centering
\begin{tabular}{rrrrrr}
\hline
& V1 & V2 & V3 & V4 & V5 \\
\hline
1 & 76,491,283,765 & 29.12346 & -7.1234 & -100.1 & 1,123 \\
2 & 152,982,567,530 & 58.24691 & -14.2468 & -200.2 & 2,246 \\
3 & 229,473,851,295 & 87.37037 & -21.3702 & -300.3 & 3,369 \\
\hline
\end{tabular}
\end{table}
Trying the same on the 1-row data.frame gives us an error, because in the process of "prettifying" your numbers, the data.frame becomes a named character vector, as you already figured out:
> xtable(prettyNum(df, big.mark = ","), align = "rrrrrr")
Error in UseMethod("xtable") :
no applicable method for 'xtable' applied to an object of class "character"
Use methods(xtable) to see the types of objects for which xtable methods have been defined. To view a specific method--for example, the xtable.data.frame method--use xtable:::xtable.data.frame. From there, you can learn how to write your own method if you think it's necessary.
However, in this case, it's not necessary. Just try one of the following options, noting the subtle difference between the them:
xtable(matrix(prettyNum(df, big.mark = ","), nrow = 1), align = "rrrrrr")
xtable(t(apply(df, 1, prettyNum, big.mark = ",")), align = "rrrrrr")
Here is some sample output: