I am trying to convert a character column from a data frame to the numerics. However, what I am getting as a result are rounded up values.
Whatever I have tried by researching other questions of the same nature on SO, hasn't worked for me. I have checked the class of the column vector I am trying to convert, and it is a character, not a factor.
Here is my code snippet:
some_data <- read.csv("file.csv", nrows = 100, colClasses = c("factor", "factor", "character", "character"))
y <- Vectorize(function(x) gsub("[^\\.\\d]", "", x, perl = TRUE))
some_data$colC <- y(data1$colC)
data1$colD <- y(data1$colCD)
data1$colC <- as.numeric(data1$colC)
data1$colD <- as.numeric(data1$colD)
Edit:
> dput(head(data1))
structure(list(colA = structure(c(2L, 2L, 5L, 6L, 5L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), colB = structure(c(1L,
2L, 2L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
colC = c("4023107.87", "3180863.42", "2558777.81", "2393736.25",
"1333148.48", "1275627.13"), colD = c("49731596.35", "33604210.26",
"20807573.12", "20061467.30", "10488358.77", "10442249.09"
)), .Names = c("colA", "colB", "colC", "colD"), row.names = c(NA,
6L), class = "data.frame")
I think this is a representation problem, not an actual rounding problem ...
options("digits") ## 7
From ?options:
‘digits’: controls the number of digits to print when printing numeric values. It is a suggestion only. Valid values are
1...22 with default 7. See the note in ‘print.default’ about
values greater than 15.
digits can be reset either on a one-off basis, i.e. print(object,digits=...), or globally, i.e. options(digits=20) (20 is probably overkill but helps you see what's happening: based on the results below, 10 might serve your needs well.)
as.numeric(data1$colC)
[1] 4023108 3180863 2558778 2393736 1333148 1275627
print(as.numeric(data1$colC),digits=10)
[1] 4023107.87 3180863.42 2558777.81 2393736.25 1333148.48 1275627.13
print(as.numeric(data1$colC),digits=20)
[1] 4023107.8700000001118 3180863.4199999999255 2558777.8100000000559
[4] 2393736.2500000000000 1333148.4799999999814 1275627.1299999998882
Related
I am using the R package DT to create a table. This table contains hyperlinks and the issue that I am having is that when I put rownames = FALSE to remove the row numbers, the formatting on the hyperlinks goes away. I was wondering if anyone had a solution to this problem?
Example data:
structure(list(school = structure(c(2L, 3L, 1L, 4L), .Label = c("Linfield",
"OSU", "UO", "Willamette"), class = "factor"), mascot = structure(c(2L,
3L, 4L, 1L), .Label = c("bearcats", "beavers", "ducks", "wildcats"
), class = "factor"), website = structure(c(1L, 3L, 2L, 4L), .Label = c("oregonstate.edu",
"linfield.edu", "uoregon.edu",
"willamette.edu"), class = "factor"),
School_colors = structure(c(2L, 1L, 3L, 4L), .Label = c("<span style=\"color:green\">green & yellow</span>",
"<span style=\"color:orange\">orange & black</span>", "<span style=\"color:purple\">purple and red</span>",
"<span style=\"color:red\">red and yellow</span>"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Code used to generate table WITH row names
datatable(df,escape = c(1,2,3))
Code used to generate table WITHOUT row names
datatable(df, rownames = FALSE,escape = c(1,2,3))
As you can see, with the second example code, the formatting in the third column is no longer there. What I want to do is create a table without row numbers but also keep the formatting of the hyperlinks
Since your deleted the rownames the indexes of your columns changed as well.
Therefore you should change your escape argument to only 1 and 2.
datatable(df, rownames = FALSE,escape = c(1,2))
A simple escape = FALSE also works for you.
datatable(df, rownames = FALSE, escape = FALSE)
We have a survey that asks for 'select all that apply' so the result is a string inside quotes with the values separated by commas. i.e. "red, black,green"
There are other question about income so I have a factor with 'low, medium, high'
I want to be able to answer questions: What percent selected 'Red', then group that by income.
I can split the string with
'''df4 <- c("black,silver,green")'''
I can create a data frame with a timestamp and the split string with
'''t2 <- as.data.frame(c(df2[2],l2))'''
I am not able to understand how to do this for all rows at one time.
Here is a DPUT of the input:
structure(list(RespData = structure(1:2, .Label = c("1/20/2020",
"1/21/2020"), class = "factor"), CarColor = c("red,blue,green,yellow",
"black,silver,green")), row.names = c(NA, -2L), class = "data.frame")
and here is a DPUT of the desired output:
structure(list(RespData = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("1/20/2020", "1/21/2020"), class = "factor"),
Cars = structure(c(3L, 1L, 2L, 4L, 5L, 6L, 2L), .Label = c("blue",
"green", "red", "yellow", "black", "silver"), class = "factor")), row.names = c(NA,
-7L), class = "data.frame")
Example of Function:
MySplitFunc <- function(ListIn) {
# build an empty data frame and set the column names
x1.all <- ListIn[0,]
names(x1.all) <- c("ResponseTime", "Descriptive")
# for each row build the data and combine to growing list
for(x in 1:nrow(ListIn)) {
#print(x)
r1 <- ListIn[x,1]
c1 <- strsplit(ListIn[x,2],",")
x1 <- as.data.frame(c(r1,c1))
# set the names and combine to all
names(x1) <- c("ResponseTime", "Descriptive")
x1.all <- rbind(x1.all,x1)
}
# strip the whitespace
x1.all <- data.frame(lapply(x1.all, trimws), stringsAsFactors = TRUE)
return(x1.all)
}
I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")
We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)
I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.
In R am reading a file with comments as csv using
read.data.raw = read.csv(inputfile, sep='\t', header=F, comment.char='')
The file looks like this:
#comment line 1
data 1<tab>x<tab>y
#comment line 2
data 2<tab>x<tab>y
data 3<tab>x<tab>y
Now I extract the uncommented lines using
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data = read.data.raw[-comment_ind,]
Which leaves me:
data 1<tab>x<tab>y
data 2<tab>x<tab>y
data 3<tab>x<tab>y
I am modifying this data through some separate script which maintains the number of rows/cols and would like to put it back into the original read data (with the user comments) and return it to the user like this
#comment line 1
modified data 1<tab>x<tab>y
#comment line 2
modified data 2<tab>x<tab>y
modified data 3<tab>x<tab>y
Since the data I extracted in read.data preserves the row names row.names(read.data), I tried
original.read.data[as.numeric(row.names(read.data)),] = read.data
But that didn't work, and I got a bunch of NA/s
Any ideas?
Does this do what you want?
read.data.raw <- structure(list(V1 = structure(c(1L, 3L, 2L, 4L, 5L),
.Label = c("#comment line 1", "#comment line 2", "data 1", "data 2",
"data 3"), class = "factor"), V2 = structure(c(1L, 2L, 1L, 2L, 2L),
.Label = c("", "x"), class = "factor"), V3 = structure(c(1L, 2L, 1L,
2L, 2L), .Label = c("", "y"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data <- read.data.raw[-comment_ind,]
# modify V1
read.data$V1 <- gsub("data", "DATA", read.data$V1)
# rbind() and then order() comments into original places
new.data <- rbind(read.data.raw[comment_ind,], read.data)
new.data <- new.data[order(as.numeric(rownames(new.data))),]