Change decimal digits for data frame column in R - r

Questions about displaying of certain numbers of digits have been posted, however, just for single values or vectors, so I hope someone can help me with this.
I have a data frame with several columns and want to display all values in one column with two decimal digits (this column only). I have tried round() and format() and options(digits) but none worked on a column (numerical). I wonder if there is a method to do this without going the extra way of converting the column to a vector and gluing all together again.
Thanks a lot!

Here's an example of how to do this with the cars data.frame that comes installed with R.
First I'll add some variability so that we have numbers with decimal places:
data=cars+runif(nrow(cars))
Then to round just a single column (in this case the dist column to 2 decimal places):
data[,'dist']=round(data[,'dist'],2)
If your data contain whole numbers then you can guarantee that all values will have 2 decimal places by using:
cars[,'dist']=format(round(cars[,'dist'],2),nsmall=2)

Related

Convert column with dates (as strings) into date type with only the year

I have a dataset (call it df) that has several columns. One of those columns is the column date, which has strings of the form "d-MON-yy" or "dd-MON-yy" depending on if the day number is less than 10 (e.g. 9-Jan-04, 15-Oct-98) or NA.
I am trying to change this to date type values, but I only need the year. Specifically, all the dates whose yy digits are less than 20 are from this century, and all the dates whose yy digits are greater than or equal to 20are from the 1900s. I want to have the four numbers of the year in the end.
Since I am only interested in the year, I don't mind a solution that returns numeric values.
In the end, I'd like to also filter out the rows that have NA on *the date variable only.
I am pretty new to R, and I have tried to make it work with several answers I found here to no avail.
Thank you.

Reading non-rectangular data in R

I have a fairly large data set in csv format that I'd like to read into R. The data is annoyingly structured (my own fault) as follows:
,US912828LJ77,,US912810ED64,,US912828D804,...
17/08/2009,101.328125,15/08/1989,99.6171875,02/09/2014,99.7265625,...
And with the second line style repeated for a few thousand times. The structure is that each pair of columns represents a timeseries of differing lengths (so that the data is not rectangular).
If I use something like
>rawdata <- read.csv("filename.csv")
I get a dataframe with all the blank entries padded with NA, and the odd columns forced to a factor datatype.
What I'd like to ultimately get to is either a set of timeseries objects (for each pair of columns) named after every even entry in the first row (the "US912828LJ77" fields) or a single dataframe with row labels as dates running from the minimum of (min of each odd column) to max of (max of each odd column).
I can't imagine I'm the only mook to put together a dataset in such an unhelpful structure but I can't see any suggestions out there for how to deal with this. Any help would be greatly appreciated!
First you need to parse every odd column to date
odd.cols = names(rawdata)[seq(1,dim(rawdata)[2]-1,2)]
for(dateCol in odd.cols){
rawdata[[dateCol]] = as.Date(rawdata[[dateCol]], "%d/%m/%Y")
}
Now I guess the problem is straightforward, you just need to find min, max values per column, create a vector running from min date to max date, join it with rawdata and handle missing values for you US* columns.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

extract columns that don't have a header or name in R

I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]

Cluster analysis on two columns that contain name of person in R

I am a beginner in R. I have to do cluster analysis in data that contains two columns with name of persons. I converted it in data frame but it is character type. To use dist() function the data frame must be numeric. example of my data:
Interviewed.Type interviewed.Relation.Type
1. An1 Xuan
2. An2 The
3. An3 Ngoc
4. Bui Thi
5. ANT feed
7. Bach Thi
8. Gian1 Thi
9. Lan5 Thi
.
.
.
1100. Xung Van
I will be grateful for your help.
You can convert a character vector to a factor using factor. A factor is basically a vector of numbers together with an attribute giving the text associated with each number, which are called levels in R. One can use as.numeric or unclass to get at the raw numbers. These can then be fed into algorithms which require numbers, like e.g. dist.
Note that the order in which numbers are associated with texts is pretty much arbitrary (in fact alphabetical), so the difference between numbers has no meaning in most applications. Therefore calling dist on this result is technically possible, but not neccessarily meaningful. For this reason, the author of this answer is not satisfied with it, even if the original poster seems to be happy about it. :-)
Also note that if there are different vectors, converting each separately will mean that the same number will represent different textual values and vice versa, unless both vectors are compromised from exactly the same set of distinct values. Additional care has to be taken if you want the same levels for both factors. One way would be to concatenate both vecotrs, turn that into a factor, and then split the result into two factor vectors.

Resources