I have a simple txt file with some data that I need to read using R.
My file contains these rows :
a, b, c, e
"1", €57,000.00, 5, 10FEB2015
"K", €0.00, 6, 15APR2016
"C", €1,444,055.00, 6, 15APR2016
As you can see : the column b is a monetary value containing a thousands separator , which is the same delimiter for data (sep=",").
sometimes you have to do it line-by-line:
library(stringi)
library(purrr)
lines <- 'a,b,c,e
"1",€57,000.00,5,10FEB2015
"K",€0.00,6,15APR2016
"C",€1,444,055.00,6,15APR2016'
dat <- readLines(textConnection(lines))
# we need the column names
cols <- stri_split_regex(dat[1], ",")[[1]]
# regular expression capture groups can do the hard work
map_df(stri_match_all_regex(dat[2:length(dat)],
'^"([[:alnum:]]+)",€([[:digit:],]+\\.[[:digit:]]{2}),([[:digit:]]+),(.*)$'),
function(x) {
setNames(rbind.data.frame(x[2:length(x)],
stringsAsFactors=FALSE),
cols)
}
) -> df
# proper types
df$b <- as.numeric(stri_replace_all_regex(df$b, ",", ""))
df$e <- as.Date(df$e, "%d%b%Y")
str(df)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 4 variables:
## $ a: chr "1" "K" "C"
## $ b: num 57000 0 1444055
## $ c: chr "5" "6" "6"
## $ e: Date, format: "2015-02-10" "2016-04-15" ...
Related
I am trying to create a data frame in which some cells have a list of strings, while others have a single string. Ideally, from this data frame, I would then be able to extract all unique lists into a new list, vector, or one-row data frame. Any tips? Reprex below:
Data frame with some lists of strings within cells:
require(stringr)
table3 <- data.frame(U1 = I(list(c("b", "d"),
c("d"),
c(NA))),
U2 = I(list(c("a", "b", "d"),
c("b"),
c("b","d"))),
U3 = I(list(c(99),
c("a"),
c("a"))),
U4= I(list(c("a"),
c(NA),
c(NA))))
rownames(table3) <- c("C1", "C2", "C3")
What I want the output to look like:
table3.elem <- data.frame(C = I(list(99, "a", "b", "d", c("b","d"), c("a", "b", "d"))))
I'm trying to ultimately reproduce the calculations for Krippendorff's alpha for multi-valued data, published in Krippendorff & Cragg (2016). Unfortunately, now that Java is no longer a thing their downloadable program to calculate this version of Krippendorff's alpha doesn't work on my computer. So trying to create a version for R that at least I can use (and hopefully others too if I can get it working okay).
Thank you!
An option is
Convert the data.frame into a list -unclass
Flatten the list (do.call + c)
Get the unique list elements
Filter out the list elements that are NA
Create a data.frame with a list column
out <- data.frame(C = I(Filter(function(x) all(complete.cases(x)),
unique(do.call(c, unclass(table3))))))
out <- out[order(lengths(out$C), !sapply(out$C, is.numeric),
sapply(out$C, head, 1)), , drop = FALSE]
row.names(out) <- NULL
-output
> out
C
1 99
2 a
3 b
4 d
5 b, d
6 a, b, d
> str(out)
'data.frame': 6 obs. of 1 variable:
$ C:List of 6
..$ : num 99
..$ : chr "a"
..$ : chr "b"
..$ : chr "d"
..$ : chr "b" "d"
..$ : chr "a" "b" "d"
..- attr(*, "class")= chr "AsIs"
You can use unlist not recursive, make unique and remove the NA to to extract unique lists.
x <- unique(unlist(table3, FALSE))
x <- x[!is.na(x)]
x <- x[order(lengths(x), sapply(x, paste, collapse= ""))] #In case it should be ordered
data.frame(C = I(x))
# C
#1 99
#2 a
#3 b
#4 d
#5 b, d
#6 a, b, d
I am using S3 methods in that way.
First, seek all commonn task between all classes programmed and put this code only once before "Usemethod". Then, I program the rest of each class.
The problem arises when you modify an argument, because they are defined by-reference. Other tasks like check arguments or define sub-functions works well in these schemas.
The next example, I modify an argument:
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
UseMethod("secure_filter", table)
}
secure_filter.data.table <- function(
table, col, value
){
return(table[ col == value,])
}
secure_filter.data.frame <- function(
table, col, value
){
return(table[table$col == !!value,])
}
and the result is wrong
df <- data.frame(a=c("a", "b", "c"), column = c("1", "2", "3"))
dt <- as.data.table(df)
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
Empty data.table (0 rows and 2 cols): a,column
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
[1] a column
<0 rows> (or 0-length row.names)
Am I using S3 well? How do I save repeated code between S3 classes?
Any example in a well known R function?
I am using this approach to re-program tidyverse scripts to data.table scripts.
Use NextMethod instead UseMethod.
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
NextMethod("secure_filter")
#UseMethod("secure_filter", table)
}
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
a column
1: a 1
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
a column
1 a 1
But I don´t know that answer is well behaved because it doesn't shot the dispatched method neither get the generic.
> sloop::s3_dispatch(secure_filter(dt, "column", 1))
secure_filter.data.table
secure_filter.data.frame
secure_filter.default
> sloop::s3_get_method(secure_filter)
Error: Could not find generic
I have a 2 column data frame (DF) of which one column contains vectors and the other is characters.
Orig. Matched
AbcD c("ab.d","Acbd","AA.D","")
jKdf c("JJf.","K.dF","JkD.","")
My aim is to strip all the punctuation marks (commas and periods) as well make everything lowercase. This is easy enough for the character column, but the vector column is more challenging.
Some lower case methods I tried using are
lapply(DF, tolower). This causes the data frame to convert to a matrix. In doing so I lose the column of vectors structure.
In regards to the punctuation, I tried
gsub("\\.", "", DF) and
gsub("\\,", "", DF) to remove the periods and commas respectively.
This causes the data frame to convert to a character list.
I guess my questions are as follows:
Is there another way to remove punctuation and convert to lowercase that preserves the data frame structure?
If not, how may i be able to convert the above outputs back into the original format; that being of a column of vectors?
I'm sure there are other ways to get this done but here's an example that works pretty well:
DF = data.frame(a = c("JJf.","K.dF","JkD.",""), b = c("ab.d","Acbd","AA.D",""))
DF2 = as.data.frame(lapply(X = DF, FUN = tolower))
DF2$a = gsub(pattern = "\\.",replacement = "", x = DF2$a)
Data frames are just special cases of lists where all the elements have the same length so coercion back and fourth isn't usually a problem.
From your description, it sounds like you have some data that looks like:
mydf <- data.frame(Orig = c("AbcD", "jKdf"),
Matched = I(list(c("ab.d","Ac,bd","AA.D",""),
c("JJf.","K.dF","JkD.",""))))
mydf
# Orig Matched
# 1 AbcD ab.d, Ac....
# 2 jKdf JJf., K.....
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : Factor w/ 2 levels "AbcD","jKdf": 1 2
# $ Matched:List of 2
# ..$ : chr "ab.d" "Ac,bd" "AA.D" ""
# ..$ : chr "JJf." "K.dF" "JkD." ""
# ..- attr(*, "class")= chr "AsIs"
Usually, if you want to replace data while maintaining the same structure, you replace with [], like this:
mydf[] <- lapply(mydf, function(x) {
if (is.list(x)) {
lapply(x, function(y) {
tolower(gsub("[.,]", "", y))
})
} else {
tolower(gsub("[.,]", "", x))
}
})
Here's the result:
mydf
# Orig Matched
# 1 abcd abd, acbd, aad,
# 2 jkdf jjf, kdf, jkd,
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : chr "abcd" "jkdf"
# $ Matched:List of 2
# ..$ : chr "abd" "acbd" "aad" ""
# ..$ : chr "jjf" "kdf" "jkd" ""
I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
I have a number of data files that I am reading into R as CSVs. I would like to specify the colClasses of certain columns in these data files, but the lengths of the dataframes are unknown as they contain species abundance data (hence, different numbers of species).
Is there a way that I can set, say, every column after the first 10 to numeric (so, ncol[10]:length(df)) using colClasses in read.csv?
This is what I tried, but to no avail:
df <- read.csv("file.csv", header=T, colClasses=c(ncols[10], rep("numeric", ncols)))
Any help would be greatly appreciated.
Thanks,
Paul
I would start with using count.fields to determine how many columns there are in the data. You can do this just on the first line.
Then, from there, you can use rep for your colClasses.
It's fugly, but works. Here's an example:
The first few lines are just to create a dummy csv file in your workspace since you didn't provide a reproducible example.
X <- tempfile()
cat("A,B,C,D,E,F",
"1,2,3,4,5,6",
"6,5,4,3,2,1", sep = "\n", file = X)
This is where the actual answer starts. Replace "x" with your actual file name in both places below. The -2 is because we have two columns that are already accounted for.
Y <- read.csv(X, colClasses = c(
"numeric", "numeric", rep("character", count.fields(textConnection(
readLines(X, n=1)), sep=",")-2)))
# Y <- read.csv("file.csv", colClasses = c(
# "numeric", "numeric", rep(
# "character", count.fields(readLines(
# "file.csv", n = 1), sep = ",")-2)))
str(Y)
# 'data.frame': 2 obs. of 6 variables:
# $ A: num 1 6
# $ B: num 2 5
# $ C: chr "3" "4"
# $ D: chr "4" "3"
# $ E: chr "5" "2"
# $ F: chr "6" "1"