Counting unique values across variables (columns) in R - r

I have a large dataset with repeated measures over 5 time periods.
2012 2009 2006 2003 2000
3 1 4 4 1
5 3 2 2 3
6 7 3 5 6
I want to add a new column, which is the number of unique values among years 2000 to 2012. e.g.,
2012 2009 2006 2003 2000 nunique
3 1 4 4 1 3
5 3 2 2 3 3
6 7 3 5 6 4
I am working in R and, if it helps, there are only 14 possible different values of the measured value at each time period.
I found this page: Count occurrences of value in a set of variables in R (per row) and tried the various solutions offered on it. What it gives me however is a count of each value, not the number of unique values.
Other similar questions on here seem to ask about counting number of unique values within a variable /column, rather than across each row.
Any suggestions would be appreciated.

Here's one alternative
> df$nunique <- apply(df, 1, function(x) length(unique(x)))
> df
2012 2009 2006 2003 2000 nunique
1 3 1 4 4 1 3
2 5 3 2 2 3 3
3 6 7 3 5 6 4

If you have a large dataset, you may want to avoid looping over the rows, but use a faster framework, like S4Vectors:
df <- data.frame('2012'=c(3,5,6),
'2009'=c(1,3,7),
'2006'=c(4,2,3),
'2003'=c(4,2,5),
'2000'=c(1,3,6))
dup <- S4Vectors:::duplicatedIntegerPairs(as.integer(as.matrix(df)), row(df))
dim(dup) <- dim(df)
rowSums(!dup)
Or, the matrixStats package:
m <- as.matrix(df)
mode(m) <- "integer"
rowSums(matrixStats::rowTabulates(m) > 0)

The trick is to use 'apply' and assign each row to a variable (e.g. x). You can then write a custom function, in this case one that uses 'unique' and 'length' to get the answer that you want.
df <- data.frame('2012'=c(3,5,6), '2009'=c(1,3,7), '2006'=c(4,2,3), '2003'=c(4,2,5), '2000'=c(1,3,6))
df$nunique = apply(df, 1, function(x) {length(unique(x))})

try this one out:
sapply(data, function(x) length(unique(x)))

Related

Getting stale values on using ifelse in a dataframe

Hi I am aggregating values from two columns and creating a final third column, based on priorities. If values in column 1 are missing or are NA then I go for column 2.
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10))
df
internal external
1 1
2 5 6
3 8
4 6 9
5 NA 10
df$final <- df$internal
df$final <- ifelse((df$final=="" | df$final=="NA"),df$external,df$final)
df
internal external final
1 1 2
2 5 6 3
3 8 4
4 6 9 4
5 NA 10 2
How can I get final value as 4 and 2 for row 3 and row 5 when the external is 8 and 2. I don't know what's wrong but these values don't make any sense to me.
The issue arises because R converts your values to factors.
Your code will work fine with
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10),stringsAsFactors = FALSE)
PS: this hideous conversion to factors should definitely belong to the R Inferno, http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set

I am trying to compute rolling means of an unbalanced data set. To illustrate my point I have produced this toy example of my data:
ID year Var RollingAvg(Var)
1 2000 2 NA
1 2001 3 2
1 2002 4 2.5
1 2003 2 3
2 2001 2 NA
2 2002 5 2
2 2003 4 3.5
The column RollingAvg(Var) is what I want, but can't get. In words, I am looking for the rolling average of ALL the previous observations of Var for each ID. I have tried using rollapply and ddply in the zoo and the plyr package, but I can't see how to set the rolling window length to use ALL the previous observations for each ID. Maybe I should use the plm package instead? Any help is appreciated.
I have seen other posts on rolling means on BALANCED panel data set, but I can't seem to extrapolate their answers to unbalanced data.
Thanks,
M
Using data.table:
library(data.table)
d = data.table(your_df)
d[, RollingAvg := {avg = cumsum(Var)/seq_len(.N);
c(NA, avg[-length(avg)])},
by = ID]
(or even simplified)
d[, RollingAvg := c(NA, head(cumsum(Var)/(seq_len(.N)), -1)), by = ID]
Assuming that years are contiguous within each ID (which is case in the example data) and DF is the input data frame, here is a solution using just base R. cumRoll is a function that performs the required operation on one ID and ave then performs it by ID:
cumRoll <- function(x) c(NA, head(cumsum(x) / seq_along(x), -1))
DF$Roll <- ave(DF$Var, DF$ID, FUN = cumRoll)
The result is:
> DF
ID year Var Roll
1 1 2000 2 NA
2 1 2001 3 2.0
3 1 2002 4 2.5
4 1 2003 2 3.0
5 2 2001 2 NA
6 2 2002 5 2.0
7 2 2003 4 3.5

Reordering (deleting/changing order) columns of data in data frame

I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.
The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:
Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)
The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows.
data.old = 418 columns (q1 though q418) x 140 rows
data.new = 415 columns (q1 through q415) x 114 rows
I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new
To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)
>data.old
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4
So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.
dataold.set1 <- dataold[1:16]
dataold.set2 <- dataold[18:19]
dataold.set3 <- dataold[21:23]
dataold.set4 <- dataold[25:26]
dataold.set5 <- dataold[30:33]
dataold.set6 <- dataold[35:36]
dataold.set7 <- dataold[38:39]
dataold.set8 <- dataold[41:42]
dataold.set9 <- dataold[44]
dataold.set10 <- dataold[46:47]
dataold.set11 <- dataold[49:54]
dataold.set12 <- datanew[43:49]
dataold.set13 <- dataold[62:85]
dataold.set14 <- dataold[87:90]
dataold.set15 <- datanew[78]
dataold.set16 <- dataold[91:142]
dataold.set17 <- dataold[149:161]
dataold.set18 <- dataold[55:61]
dataold.set19 <- dataold[163:170]
I then was attempting to put the columns back together into one set
I tried both
dataold.adjust <- merge(dataold.set1, dataold.set2)
dataold.adjust <- merge(dataold.adjust, dataold.set3)
dataold.adjust <- merge(dataold.adjust, dataold.set4)
and I also tried
dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)
However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.
Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.
Thanks for the assistance!
You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust). You just need a vector that orders the columns correctly:
cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1), # 1st cols from dataold
ncol(dataold.adjust)-length(43:49):1, # 1st cols from datanew
(length(cols1)+1):length(cols2), # 2nd cols from dataold
ncol(dataold.adjust), # 2nd cols from datanew
(length(cols1)+length(cols2)+1):length(cols3))] # 3rd cols from dataold
The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)

Resources