Function in R to determine complete cases - r

I have the following code to print the complete cases:
complete <- function(directory, id=1:332) {
data<-NULL
dat <- NULL
s <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
df[[i]]<-c(read.csv(data[[i]]))
s[i] <- sum(complete.cases(df[[i]]))
dat <- data.frame(cbind(id,nobs=s[i]))
}
dat
}
The output that I get is as follows:
complete("specdata", c(2, 4, 8, 10, 12))
id nobs
1 2 96
2 4 96
3 8 96
4 10 96
5 12 96
The required output looks like this:
complete("specdata", c(2, 4, 8, 10, 12))
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
The .csv looks like this:
head(file)
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
2 2003-01-02 NA NA 1
3 2003-01-03 NA NA 1
4 2003-01-04 NA NA 1
5 2003-01-05 NA NA 1
6 2003-01-06 NA NA 1
As is evident from the 2 outputs the nobs value for all id's is the replicated corresponding to id==12. I'm unable to figure out a way to output the nobs corresponding to id. Lets ignore the ## in each line of the required output. Thanks in advance.

I tried to clean your code:
complete <- function(directory, id) {
s <- vector()
for (i in 1:length(id)) {
path <- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
data <- c(read.csv(path))
s[i] <- sum(complete.cases(data))
}
dat <- data.frame(cbind(id,nobs=s))
return(dat)
}
If this does not work, you probably might want to check your functions formatC and complete.cases.
EDIT:
There were several redundancies in your code as well as two logical errors.
First, you dont need to initialize objects in R in order to give them a value. I deleted these two
data<-NULL
dat <- NULL
and changed the third into an empty vector. Second, you can see that I removed the indices of your dat and data objects and gave them a more expressive name. As these two objects are newly created in every iteration of the for-loop, it makes no sense to give them an index. Finally, you misplaced the closing bracket (as mentioned above) and created your dat$nobs from only one element of s (namely the last one):
dat <- data.frame(cbind(id,nobs=s[i]))
Fixing this into
dat <- data.frame(cbind(id,nobs=s))
did the trick.
Please consider reading a good beginners book on (R-)programming to gain a better understanding of control structures.

Related

Remove missing values with user defined function

I have a dataset as data with missing values.
a <- sample(1:100,15)
b <- sample(1:20,15)
data <- data.frame(a,b)
data[c(3,6,8,12),2] <- NA
data
Now I want to delete the rows with missing values by one variable at a time. (Don't want to use na.omit() ). I have written the following function, but it's not working.
rmv_missing <- function(y,z){
z <- z[is.na(z$y) == TRUE,]
return(z)
}
rmv_missing("b",data)
Also tried this one...
library(dplyr)
na_values <- function(x,y,z){
z <- (filter(z,!is.na(y)))
return(z)
}
rmv_missing("b",data)
None of these functions are working. Could someone help me to understand where did I make the mistake and rectify the code. Thanks in advance.
First thing, you don't really need the "== T" since the "is.na" function spits out already a logical vector. The other problem is that accessing a data.frame as "data$b" will not work within a function. So instead, do the following et voilĂ :
rmv_missing <- function(y,z)
{
print(z$y) # Does not work
print(z[, y]) # Works
z[is.na(z[, y]),]
}
rmv_missing("b",data)
# NULL
# [1] 9 16 NA 5 13 NA 8 NA 11 17 20 NA 10 12 1
# a b
# 3 33 NA
# 6 59 NA
# 8 81 NA
# 12 26 NA

How to use if else statement in a dataframe when comparing dates?

I have a dataframe D and I would want to calculate a daily return of "Close" only if they share the same month. So for example there would be 0 for 1995-08-01
Date Close Month
1 1995-07-27 163.32 1995-07
2 1995-07-28 161.36 1995-07
3 1995-07-30 162.91 1995-07
4 1995-08-01 162.95 1995-08
5 1995-08-02 162.69 1995-08
I am trying to use an if-else statement and looping to apply it on other dataframes.
D1 <- D[-1,]
for (i in c("Close"))
{ TT <- dim(D)[1]
if (D[1:(TT-1),"Month"] == D[2:TT,"Month"]) {
D1[,i] = round((100*(log(D[2:TT,i]/D[1:(TT-1),i]))), digits = 4)
}
else {
D1[i] = 0 }
}
I get these results but in the forth row it should be 0.0000 because the forth row is a from different month than the the third row. Moreover, I get this warning message : "Warning message: In if (D[1:(TT - 1), "Month"] == D[2:TT, "Month"]) { : the condition has length > 1 and only the first element will be used". Can you please help me out? Thank you.
Date Close Month
1 1995-07-27 0.5903 1995-07
2 1995-07-28 1.4577 1995-07
3 1995-07-30 0.9139 1995-07
4 1995-08-01 0.0006 1995-08
5 1995-08-02 0.0255 1995-08
Next time you should REALLY provide a reproducible example here I did it for you. My solution uses diff and ifelse as requested.
month <- c(1,1:5,5:6)
data <- (1:8)*(1:8)
df <- data.frame(cbind(month, data))
diffs <- sapply(df, diff)
diffs <- data.frame(rbind(NA, diffs))
df$result <- ifelse(diffs$month==0, diffs$data, 0)
df
month data result
1 1 1 NA
2 1 4 3
3 2 9 0
4 3 16 0
5 4 25 0
6 5 36 0
7 5 49 13
8 6 64 0
if() expects a single value (usually TRUE or FALSE, but can also be 0 or 1, and it can handle other single values, e.g., it treats positive values like ones). You are feeding in a vector of values. The warning message is telling you that it is ignoring all the other values of the vector except the first, which is usually a strong indication that your code is not doing what you intend it to do.
Here's one do-it-yourself approach with no loops (I'm sure some time-series package has a function to calculate returns):
# create your example dataset
D <- data.frame(
Date = (as.Date("1995-07-27") + 0:6)[-c(3,5)],
Close = 162 + c(1.32, -.64, .91, .95, .69)
)
# get lagged values as new columns
D$Close_lag <- dplyr::lag(D$Close)
D$Date_lag <- dplyr::lag(D$Date)
# calculate all returns
D$return <- D$Close / D$Close_lag - 1
# identify month switches
D$new_month <- lubridate::month(D$Date) != lubridate::month(D$Date_lag)
# replace returns with zeros when month switches
D[!is.na(D$return) & D$new_month==TRUE, "return"] <- 0
# print results
D

If (condition), add 1 to previous value, else, subtract 1

I'm tracking Meals and satiety in a dataframe. I would like to have R add 1 to the previous value in the satiety column when a meal is eaten, and subtract 1 when no meal is eaten (meal=NA).
I'm trying to accomplish this with a for loop nested in an ifelse statement but it is not working.
My current attempt:
ifelse(Meals=="NA",for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]-1+i)}, for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]+1+i)}
Error: Error in ans[test & ok] <- rep(yes, length.out = length(ans))
[test & ok] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = length(ans)) :
'x' is NULL so the result will be NULL
I'm not sure how to create a table on here but I will do my best to make sense.
Time: 9:30 AM 10:00 AM 10:30 AM ETC
Meals: NA NA Breakfast NA NA Snack NA NA NA ETC
Satiety: Range from 0-10.
My current satiety data is just a vector I created, but I would like it to start at 0 and increase by 1 after every meal, while decreasing by 1 after every 30 minute timeframe where there is no meal(where meal= NA).
I'm sure there is a much better way to do this.
Thank you.
Here's some sample data and a potential solution.
set.seed(123)
meals <- sample(c(1, 1, 1, NA), 20, replace = TRUE)
df <- data.frame(meals = meals)
head(df)
# meals
# 1 1
# 2 NA
# 3 1
# 4 NA
# 5 NA
# 6 1
df$meals[is.na(df$meals)] <- -1
df$satiety <- cumsum(df$meals)
head(df)
# meals satiety
# 1 1 1
# 2 -1 0
# 3 1 1
# 4 -1 0
# 5 -1 -1
# 6 1 0
tail(df)
# meals satiety
# 15 1 5
# 16 -1 4
# 17 1 5
# 18 1 6
# 19 1 7
# 20 -1 6
I would suggest not coding the absence of a meal (or a skipped meal) as NA which means "I don't know". If you're using NA to mean the meal was skipped, than you do actually know and you should give it something that represents a skipped meal. Here, since your model interprets a skipped meal as having a negative impact on satiety (not a neutral impact), -1 actually makes quite a lot of sense. If that's how you use it in your model, then code it that way.
A couple of things here.
Unless the data includes the string "NA", you should use the command is.na(x) to check if a value or values are NA. It's hard to tell however without sample data.
Generally speaking, in R you will want to use vectorised solutions. In many cases, if you're using a for loop, it's incorrect.
You've stated that "Meals" is in a dataframe. As such, you will need to refer to Meals as a subset of that data frame. For example, if the data frame is data, then the expression should be data$Meals.
Summarising all of this, I'd probably do something similar to the following:
Day$Meals.na <- is.na(Day$Meals)
print(Day$Fullness + (-1)^Day$Meals.na)
This uses a nice trick: TRUE and FALSE are both stored as 1 and 0 respectively under the hood.
Hopefully this helps. If not, we'd really need sample data and expected outputs to be able to be of more use.

Column looping through a user function and storing output in a newly created column (R)

I have some data that consists of an oscillatory-like pattern and would like to take some measurements of the peaks. I have several chunks of code and most of them work to do exactly what I want. The main issue I'm having is that I have no idea how to integrate them to work functionally together.
Essentially, I would like to use the freq function I've written on a dataframe so that it will loop through each column (a, b, and c) and give me the results of the function. Then I would like to store the output for each column in a new dataframe with the column names matching the source names.
I have read a lot of answers about looping through columns and creating new columns in a dataframe, which is how I've gotten to this point. Some of the individual pieces need a little tweaking but what I can't find anywhere is a good explanation of how I can put it all together. I have tried to no avail; I just can't see to get the order right.
(For reproducible data)
library(zoo)
count = 1:20
a = c(-0.802776, -0.748272, 0.187434, 1.23577, 1.00677, 0.874122, 0.232802, -0.279368, -1.57815, -1.76652, -0.958916, -0.316385, 0.831575, 1.19312, 1.45508, 0.848923, 0.257728, -0.318474, -1.14129, -1.42576)
b = c(-2.23512, -1.36572, -0.0357366, 0.925563, 1.53282, 0.171045, -0.438714, -1.38769, -0.696898, 1.37184, 2.01038, 2.6302, 2.53296, 1.8788, 0.100366, -1.34726, -1.4309, -1.37271, -0.750669, 0.100656)
c = c(0.749062, 0.0690315, -0.750494, -1.04069, -0.654432, 0.0186072, 0.710011, 0.920915, 1.13075, 0.227108, -0.195086, -0.68333, -0.607532, -0.485424, 0.495913, 0.655385, 0.468796, 0.274053, -0.906834 , 0.321526)
test = data.frame(count, a, b, c)
d = 20:40
This is the chunk of code I've written to go through any data I specify and identify local peaks, then calculate a series of things from the identified peaks. It works really well and there's no issue with the functionality of this (however, suggestions to make it better are welcome), just with putting it together with the rest.
I would like to loop through columns of a dataframe (using a for loop in the next section to accomplish that) and get the result of the freq function for each column
freq = function(x, y, data, w=1, span = 0.05, ...) {
require(zoo)
n = length(y)
y.smooth = loess(y ~ x, span = span)$fitted
y.max = rollapply(zoo(y.smooth), 2*w+1, max, align = "center")
delta = y.max - y.smooth[-c(1:w, n+1-1:w)]
i.max = which(delta <= 0) + w #identifies peaks
list(x = x[i.max], i = i.max, y.hat = y.smooth)
dist = diff(i.max) #calculates distance between peaks
instfreq = (25/dist) #calculates the rate of each peak occurence
print(instfreq) #output I ultimately want
}
#example
freq(count, a, span = 0.5)
This is how I'm looping through columns in a specified dataframe. Also, I'm not sure what I've done but this ends up printing my output twice...(which I'd like to avoid).
for(i in test){
output <- freq(test$count, y = i, span = 0.5)
print(output)
}
This is probably the part giving me the biggest headache. This should add new columns to an existing dataframe. It works so far but I have yet to figure out how to integrate it into the stuff above. Also, I'd really like for it to store the output in a new dataframe, rather than the source dataframe.
For reference, here df = data, to.add = data to add to df, new.name = name of new col
Another thing I'd like is for the new.name to come from the source (to.add). For example if I tried to add d (from above) to the end of test, I'd like for the column name (new.name) to read d without having to specify it. This will be helpful when I'm looping through multiple columns and want to keep the source name from which the output was calculated.
add.col = function(df, to.add, new.name){
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add; # names new col whatever was placed in new.name arg
return(head(df)) #shortened output so I can verify it worked
#when I was testing it for myself, this would
#need to be changed so that it adds the column
#to a dataframe and stores the results, which
#I believe would require I use print() and a store
#like Results = print(df)
}
#example
addcol(test, d, "d") #would like the code to grab the name d just from the to.add
#argument, without having to specify "d" as the new.name
Any help, suggestions, or refinements (to make it less clunky, more efficient, etc) would be greatly appreciated.
I can get by with the for loop (if the duplications get fixed) as long as I can figure out how to store all the output together in one place. My actual data is in a similar format to the reproducible set above, it just has far more rows and columns (and will already be in a .csv dataframe rather than creating it from individual vectors).
I've been beating my head over this for a few days now and have gotten so far but just can't get it all the way.
Also, feel free to edit the title to help it get to the right people!
Ok first of all, the reason your function is printing the output twice is because essentially what happens is:
instfreq gets calculated and returned
instfreq gets printed out
instfreq is getting assigned to output
output gets printed out again
Furthermore, I suppose you don't want you function to try and calculate it for the count argument (which returns numeric(0)) so it would be best to run it only for the other columns.
Lastly, such simple for loops can easily be replaced by the apply function in r. Which brings the first part of your question to:
freq = function(x, y, data, w=1, span = 0.05, ...) {
require(zoo)
n = length(y)
y.smooth = loess(y ~ x, span = span)$fitted
y.max = rollapply(zoo(y.smooth), 2*w+1, max, align = "center")
delta = y.max - y.smooth[-c(1:w, n+1-1:w)]
i.max = which(delta <= 0) + w #identifies peaks
list(x = x[i.max], i = i.max, y.hat = y.smooth)
dist = diff(i.max) #calculates distance between peaks
instfreq = (25/dist) #calculates the rate of each peak occurence
return(instfreq) #output I ultimately want
}
output <- apply(test[,2:length(test[1,])],2, function(v) freq(test$count, y=v, span=0.5))
output
# a b c
#2.500000 3.571429 2.777778
the second part of your question wants to return the name of a variable to use it as the name of the new column. for this we can use deparse(substitute(variable)) so your function becomes:
add.col = function(df, to.add){
new.name <- deparse(substitute(to.add))
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add; # names new col whatever was placed in new.name arg
return(df)
}
#example
dnametest = 20:40
add.col(test, dnametest)
# count a b c dnametest
#1 1 -0.802776 -2.2351200 0.7490620 20
#2 2 -0.748272 -1.3657200 0.0690315 21
#etc.
this function will not override your original dataframe, so you simply need to assign it to a new dataframe:
newframe <- add.col(test, dnametest)
EDIT adding a possibility to loop x amount of arrays:
The first problem you'll have while trying to loop, is that you're working with arrays of different lengths. This makes it hard to work with dataframes so you'll have to work with lists. In this case it will be easier to write a new function which takes any amount of arrays, and loops them for you automatically. Because it's easier to catch and add the names in this function, I've readjusted your function add.col to take new.name again:
add.col = function(df, to.add, new.name){
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add;
return((df))
}
then I can write a second function add.multicol like this:
#this function takes in an unspecfied number of arguments
add.multicol <- function(df, ...){
#convert this number of arguments to a list
to.add.cols <- list(...)
#add the variable names to this list
names(to.add.cols) <- as.list(substitute(list(...)))[-1]
#find number of columns to add
number.cols.to.add <- length(to.add.cols)
#loop add.col
newframe <- df
for(i in 1:number.cols.to.add){
to.add.col <- array(unlist(to.add.cols[i]))
to.add.col.name <- names(to.add.cols[i])
newframe <- add.col(newframe,to.add.col,to.add.col.name)
}
return(newframe)
}
This will allow you to do whichever you want. Example:
dnametest <- 20:40
test1 <- 1:15
test2 <- 25:56
argumentsake <- seq(0,1,length=21)
#run function
newframe <- add.multicol(test,dnametest,test1,test2,argumentsake)
newframe
# count a b c dnametest test1 test2 argumentsake
#1 1 -0.802776 -2.2351200 0.7490620 20 1 25 0.00
#2 2 -0.748272 -1.3657200 0.0690315 21 2 26 0.05
#3 3 0.187434 -0.0357366 -0.7504940 22 3 27 0.10
#4 4 1.235770 0.9255630 -1.0406900 23 4 28 0.15
#5 5 1.006770 1.5328200 -0.6544320 24 5 29 0.20
#6 6 0.874122 0.1710450 0.0186072 25 6 30 0.25
#7 7 0.232802 -0.4387140 0.7100110 26 7 31 0.30
#8 8 -0.279368 -1.3876900 0.9209150 27 8 32 0.35
#9 9 -1.578150 -0.6968980 1.1307500 28 9 33 0.40
#10 10 -1.766520 1.3718400 0.2271080 29 10 34 0.45
#11 11 -0.958916 2.0103800 -0.1950860 30 11 35 0.50
#12 12 -0.316385 2.6302000 -0.6833300 31 12 36 0.55
#13 13 0.831575 2.5329600 -0.6075320 32 13 37 0.60
#14 14 1.193120 1.8788000 -0.4854240 33 14 38 0.65
#15 15 1.455080 0.1003660 0.4959130 34 15 39 0.70
#16 16 0.848923 -1.3472600 0.6553850 35 NA 40 0.75
#17 17 0.257728 -1.4309000 0.4687960 36 NA 41 0.80
#18 18 -0.318474 -1.3727100 0.2740530 37 NA 42 0.85
#19 19 -1.141290 -0.7506690 -0.9068340 38 NA 43 0.90
#20 20 -1.425760 0.1006560 0.3215260 39 NA 44 0.95
#21 NA NA NA NA 40 NA 45 1.00
#22 NA NA NA NA NA NA 46 NA
#23 NA NA NA NA NA NA 47 NA
#24 NA NA NA NA NA NA 48 NA
#25 NA NA NA NA NA NA 49 NA
#26 NA NA NA NA NA NA 50 NA
#27 NA NA NA NA NA NA 51 NA
#28 NA NA NA NA NA NA 52 NA
#29 NA NA NA NA NA NA 53 NA
#30 NA NA NA NA NA NA 54 NA
#31 NA NA NA NA NA NA 55 NA
#32 NA NA NA NA NA NA 56 NA
EDIT 2: extending the loop to take in dataframes of any form as well
now it becomes quite messy, you will also need to rename your output elements so they don't match any column names already present.
add.multicol <- function(df, ...){
#convert this number of arguments to a list
to.add.cols <- list(...)
#find number of columns to add
number.args <- length(to.add.cols)
#number of elements per list entry
hierarch.cols.to.add <- array(0,length(number.args))
for(i in 1:number.args){
#if this list element has only one name, treat it as an array, else treat it as a data frame
if(is.null(names(to.add.cols[[i]]))){
#get variable names from input of normal arrays
names(to.add.cols[[i]]) <- as.list(substitute(list(...)))[i+1]
hierarch.cols.to.add[i] <- 1
} else {
#find the number of columns in the data frame
number <- length(names(to.add.cols[[i]]))
hierarch.cols.to.add[i] <- number
}
}
#loop add.col
newframe <- df
for(i in 1:number.args){
#if array
if(hierarch.cols.to.add[i]==1){
to.add.col <- array(unlist(to.add.cols[[i]]))
to.add.col.name <- names(to.add.cols[[i]][1])
newframe <- add.col(newframe,to.add.col,to.add.col.name)
} else { #if data.frame
#foreach column in the data frame
for(j in 1:hierarch.cols.to.add[i]){
#if only one element per column
if(is.null(dim(to.add.cols[[i]]))){
to.add.col <- to.add.cols[[i]][j]
} else { #if multiple elements per column
to.add.col <- to.add.cols[[i]][,j]
}
to.add.col.name <- names(to.add.cols[[i]])[j]
newframe <- add.col(newframe,to.add.col,to.add.col.name)
}
}
}
return(newframe)
}
testdf <- data.frame(cbind(test1,test2))
dnametest <- 20:40
output <- apply(test[,2:length(test[1,])],2, function(v) freq(test$count, y=v, span=0.5))
#edit output names because we can't have a dataframe with the same name for multiple columns
names(output) <- c("output_a","output_b","output_c")
newframe <- test
#function now takes dataframes of single elements, normal data frames and single arrays
newframe <- add.multicol(newframe,output,dnametest,testdf)
# count a b c output_a output_b output_c dnametest test1 test2
#1 1 -0.802776 -2.2351200 0.7490620 2.5 3.571429 2.777778 20 0 25
#2 2 -0.748272 -1.3657200 0.0690315 NA NA NA 21 1 26
#3 3 0.187434 -0.0357366 -0.7504940 NA NA NA 22 2 27
#4 4 1.235770 0.9255630 -1.0406900 NA NA NA 23 3 28
#...
this is probably not the most efficient way, but it gets the job done

Is there a way stop table from sorting in R

Problem setup: Creating a function to take multiple CSV files selected by ID column and combine into 1 csv, then create an output of number of observations by ID.
Expected:
complete("specdata", 30:25) ##notice descending order of IDs requested
## id nobs
## 1 30 932
## 2 29 711
## 3 28 475
## 4 27 338
## 5 26 586
## 6 25 463
I get:
> complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932
Which is "wrong" because it has been sorted by id.
The CSV file I read from does have the data in descending order. My snippet:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
xTab3<-as.data.frame(table(dfTable$ID[ccTab]),)
colnames(xTab3)<-c("id","nobs")
And as near as I can tell, the third line is where sorting occurs. I broke out the expression and it happens in the table() call. I've not found any option or parameter I can pass to make something like sort=FALSE. You'd think...
Anyway. Any help appreciated!
So, the problem is in the output of table, which are sorted by default. For example:
> r = sample(5,15,replace = T)
> r
[1] 1 4 1 1 3 5 3 2 1 4 2 4 2 4 4
> table(r)
r
1 2 3 4 5
4 3 2 5 1
If you want to take the order of first appearance, you are going to get your hands a little bit dirty by recoding the table function:
unique_r = unique(r)
table_r = rbind(label=unique_r, count=sapply(unique_r,function(x)sum(r==x)))
table_r
[,1] [,2] [,3] [,4] [,5]
label 1 4 3 5 2
count 4 5 2 1 3
One way to get around this is...don't use table. Here's an example where I create three one-line data sets from your data. Then I read them in with a descending sequence, with read.table and it seems to be okay.
The real big thing here is that multiple data sets should be placed in a list upon being read into R. You'll get the exact order of data sets you want that way, among other benefits.
Once you've read them into R the way you want them, it's much easier to order them at the very end. Ordering of rows (for me) is usually the very last step.
> dat <- read.table(h=T, text = "id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932")
Write three one-line files:
> write.table(dat[3,], "dat3.csv", row.names = FALSE)
> write.table(dat[2,], "dat2.csv", row.names = FALSE)
> write.table(dat[1,], "dat1.csv", row.names = FALSE)
Read them in using a 3:1 order:
> do.call(rbind, lapply(3:1, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 27 338
# 2 26 586
# 3 25 463
Then, if we change 3:1 to 1:3 the rows "comply" with our request
> do.call(rbind, lapply(1:3, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 25 463
# 2 26 586
# 3 27 338
And just for fun
> fun <- function(z){
do.call(rbind, lapply(z, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE) }))
}
> fun(c(2, 3, 1))
# id nobs
# 1 26 586
# 2 27 338
# 3 25 463
You may try something like this:
t1 <- c(5,3,1,3,5,5,5)
as.data.frame(table(t1)) ##result in ascending order
# t1 Freq
#1 1 1
#2 3 2
#3 5 4
t1 <- factor(t1)
as.data.frame(table(reorder(t1, rep(-1, length(t1)),sum)))
# Var1 Freq
#1 5 4
#2 3 2
#3 1 1
In your case you are complaining about the actions of the table function with a single argument returning the items with the names in ascending order and you wnat them in descending order. You could have simply used the rev() function around the table call.
xTab3<-as.data.frame( rev( table( dfTable$ID[ccTab] ) ),)
(I'm not sure what that last comma is doing in there.) The sort order in the original would not be expected to determine the order of a table operation. Generally R will return results with discrete labels sorted in alpha (ascending) order unless the levels of a factor item have been specified differently. That's one of those R-specific rules that may be difficult to intuit. The other R-specific rule that may be difficult to grasp (although not really a problem here) is that arguments are often expected to be in the form of R-lists.
It's probably wise to think about R-table objects at this point (and what happens with the as.data.frame call. table-objects are actually R-matrices, so the feature that you wanted to sort by was actually the rownames of that table object and are of class character:
r = sample(5,15,replace = T)
table(r)
#r
#2 3 4 5
#5 3 2 5
rownames(table(r))
#[1] "2" "3" "4" "5"
str(as.data.frame(table(r)))
#-------
'data.frame': 4 obs. of 2 variables:
$ r : Factor w/ 4 levels "2","3","4","5": 1 2 3 4
$ Freq: int 5 3 2 5
I just wanna share this homework I've done
complete <- function(directory, id=1:332){
setwd("E:/Coursera")
files <- dir(directory, full.names = TRUE)
data <- lapply(files, read.csv)
specdata <- do.call(rbind, data)
cleandata <- specdata[!is.na(specdata$sulfate) & !is.na(specdata$nitrate),]
targetdata <- data.frame(Date=numeric(0), sulfate=numeric(0), nitrate=numeric(0), ID=numeric(0))
result<-data.frame(id=numeric(0), nobs=numeric(0))
for(i in id){
targetdata <- cleandata[cleandata$ID == i, ]
result <- rbind(result, data.frame(table(targetdata$ID)))
}
names(result) <- c("id","nobs")
result
}
A simple solution that no one has proposed yet is combining table() with unique() function. The unique() function does the behaviour that you are looking (listing unique IDs in order of appearance).
In your case it would be something like this:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
x<-dfTable$ID[ccTab] #unique IDs
xTab3<-as.data.frame(table(x)[unique(x)],) #here you sort the "table()" result in order of appearance
colnames(xTab3)<-c("id","nobs")

Resources