I have a fairly large data set in csv format that I'd like to read into R. The data is annoyingly structured (my own fault) as follows:
,US912828LJ77,,US912810ED64,,US912828D804,...
17/08/2009,101.328125,15/08/1989,99.6171875,02/09/2014,99.7265625,...
And with the second line style repeated for a few thousand times. The structure is that each pair of columns represents a timeseries of differing lengths (so that the data is not rectangular).
If I use something like
>rawdata <- read.csv("filename.csv")
I get a dataframe with all the blank entries padded with NA, and the odd columns forced to a factor datatype.
What I'd like to ultimately get to is either a set of timeseries objects (for each pair of columns) named after every even entry in the first row (the "US912828LJ77" fields) or a single dataframe with row labels as dates running from the minimum of (min of each odd column) to max of (max of each odd column).
I can't imagine I'm the only mook to put together a dataset in such an unhelpful structure but I can't see any suggestions out there for how to deal with this. Any help would be greatly appreciated!
First you need to parse every odd column to date
odd.cols = names(rawdata)[seq(1,dim(rawdata)[2]-1,2)]
for(dateCol in odd.cols){
rawdata[[dateCol]] = as.Date(rawdata[[dateCol]], "%d/%m/%Y")
}
Now I guess the problem is straightforward, you just need to find min, max values per column, create a vector running from min date to max date, join it with rawdata and handle missing values for you US* columns.
Related
I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662
Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.
I am trying to compare two Excel files (same number of columns, but sometimes different number of rows).
I imported the Excel files to data1 and data2 respectively.
library(dataCompareR)
comparedata <- rCompare(data1, data2)
summary(comparedata)
saveReport(comparedata, reportName = 'Comparison Result')
All goes well, but I have three challenges:
The Sample row data is set to 5. How can I increase that to the actual different row that the summery comes up with?
How can I ask the primary key in the result as it only shows the two matching columns?
Sometimes the numbers of the row don't match, and the data gets off. Can I set up a primary comparison key instead of row to row?
I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!
I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).
I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]