Computing subset of column means in data frame (R programming)

Computing subset of column means in data frame (R programming) - r

I have a simple data frame:
a=data.frame(first=c(1,2,3),second=c(3,4,5),third=c('x','y','z'))
I'm trying to return a data frame that contains the column means for just the first and second columns. I've been doing it like this:
apply(a[,c('first','second')],2,mean)
Which returns the appropriate output:
first second
2 4
However, I want to know if I can do it using the function by. I tried this:
by(a, c("first", "second"), mean)
Which resulted in:
Error in tapply(seq_len(3L), list(`c("first", "second")` = c("first", :
arguments must have same length
Then, I tried this:
by(a, c(T, T,F), mean)
Which also did not yield the correct answer:
c(T,T,F): FALSE
[1] NA
Any suggestions? Thanks!

You can use colMeans (column means) on a subset of the original data
> a <- data.frame(first = c(1,2,3), second = c(3,4,5), third = c('x','y','z'))
If you know the column number, but not the column name,
> colMeans(a[, 1:2])
## first second
## 2 4
Or, if you don't know the column numbers but know the column name,
> colMeans(a[, c("first", "second")])
## first second
## 2 4
Finally, if you know nothing about the columns and want the means for the numeric columns only,
> colMeans(a[, sapply(a, is.numeric)])
## first second
## 2 4

by() is not the right tool, because it is a wrapper for tapply(), which partitions your data frame into subsets that meet some criteria. If you had another column, say fourth, you could split your data frame using by() for that column and then operate on rows or columns using apply().

Related

How to find specific rows that are present in one data frame and not another

I have 2 large data frames that are suppose to be duplicates of each other. However there is 3 extra rows in the second data frame. I need to find these 3 extra rows that are not present in the first data frame and remove them from the second data frame so that the data frames are the same. The 3 rows could be located anywhere in the data frame, not just added at the end.
I don't know the most efficient way to go about this. I have tried using the %in% operator along side ! to go through each column in the data to find the rows that differ but this is taking too long as there are over 100 columns.
Has anyone got a more efficient way to do such a task?
Thanks

I think the most efficient way would be just to use the first data.frame which does not have those extra rows.
But if you need to know where they are in the second you can use in case the rows of the data.frame are unique duplicated:
which(!tail(duplicated(rbind(x, y)), -nrow(x)))
#[1] 4 5
or using interaction and %in%:
which(!interaction(y) %in% interaction(x))
#[1] 4 5
or using paste and %in%:
which(!do.call(paste, y) %in% do.call(paste, x))
#[1] 4 5
Data:
x <- data.frame(a=1:3)
y <- data.frame(a=1:5)

you can use anti_join methods but in pandas with dataframe they do not exist... so you can do this with merge :
def anti_join(x, y, on):
"""Return rows in x which are not present in y (dataframe)"""
ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on)
ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge')
return ans
the first method if you want to check on only 1 column
def anti_join_all_cols(x, y):
"""Return rows in x which are not present in y"""
assert set(x.columns.values) == set(y.columns.values)
return anti_join(x, y, x.columns.tolist())
the second for all columns in df
the return will give you only row in df2 NOT IN df, be careful with the direction of the parameters, if we reverse df and df2 the result will not be the same...
you can make :
df_difference = anti_join_all_cols(df2,df)
source : https://gist.github.com/sainathadapa/eb3303975196d15c73bac5b92d8a210f

R replace identical column character items with increasing number

I have a data frame with 60000 obs. of 4 variables in the following format:
I need to replace all character items in the first column with the same character with the number 1. So "101-startups" is 1, "10i10-aps" is 2, 10x is 3 and all 10x-fund-lp are 4 and so on. The same for the second column.
How do I achieve this?

If I'm understanding your question correctly, all you need to do is something like:
my_data$col_1 <- as.integer(factor(my_data$col1, levels = unique(my_data$col1))
my_data$col_2 <- as.integer(factor(my_data$col2, levels = unique(my_data$col2))
Probably a good idea to read up on factors

Try building a separate dataframe from the unique entries of that column, then use the row names (which will be consecutive integers). If your dataframe is df and that first column is v1, something like
x = data.frame(v1 = unique(df$v1))
x$numbers = row.names(x)
Then you can do some kind of merge
final.df = merge(x, df, by = "v1")
and then using something like dplyr to select/drop/rearrange columns

Obtain a data.frame (or list) of X times the the original source, after applying some function

I've been having problems with this one for a while.
What I would like, is to apply a function to a data.frame that is divided by factors. This data frame has n>2 columns of values that I need to use for this function.
For the sake of this example, this dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like to apply a number of functions that takes into account each column of values (auto.arima first and forecast.Arima, in this case). A dataset to play follows:
library(forecast)
set.seed(2)
dat <- data.frame(factors = letters[1:5],values1 = rnorm(50), values2 =rnorm(50))
This previous dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like (for the sake of the exercise), to apply auto.arima to values1 and values 2, per factor. My expected output would be something that, per factor, takes into account both columns of values, and forecasts both (each as its own univariate time series). So if the dataset has 5 factors and 2 columns of values, I would need 10 lists/data.frames.
Some options that did not work: Splitting the data.frame per factor via:
split(dat, dat$factor)
And then using rapply:
rapply(dat,function(x) forecas.Arima(auto.arima(x)),dat$factors)
Or lapply:
lapply(split(dat,dat$factors), function(x) forecast.Arima(auto.arima(x)))
And some other combinations, all to no avail.
I thought that the easiest solution would involve a function in the apply family, but any solution would be valid.

Is this what you're looking for?
m = melt(dat, id.vars = "factors")
l = split(m, paste(m$factors, m$variable))
lapply(l, function(x) forecast.Arima(auto.arima(x$value)))
i.e. splitting the data into 10 different frames, then applying the forecast on the values?

The problem with you apply solutions is that you were passing the whole dataframe to the auto.arima function which take a vector so you'd need something like this:
lapply(split(dat,dat$factors), function(df) {
apply(df[,-1], 2, function(col) forecast.Arima(auto.arima(col)))
})
This splits the dataframe as before on the factors and then applies over each column (ignoring the first which is the factor) the auto.arima function wrapped in forecast.Arima. This returns a list of lists (5 factors by 2 values) so allows you to keep values1 and values2 separate.
You can use unlist(x, recursive=FALSE) to flatten this to a list of 10.

R: Removing zero variance columns from each element of dataframe list

I split a dataframe to create a dataframe list. The dataframe list has 401 dataframes. In other words, each dataframe is identical in structure (same columns), but potentially different numbers of rows.
When I split the dataframe, I introduced 0 variance columns (colSums=0). Dataframes in the list may share 0 variance columns, or they may have totally different columns with 0 variance.
I have used the following function (from Quickly remove zero variance variables from a data.frame) to remove 0 variance columns from each dataset:
zeroVar <- function(data, useNA = 'ifany') { out <- apply(data, 2,
function(x) {length(table(x, useNA = useNA))}) which(out==1) }
When I pass my data frame list to the function (ignoring the first two character columns of dataframe_list):
dataframe_list_zero_var_rm<-lapply(dataframe_list, function(d) d[,-zeroVar(d[,3:ncol(d)], useNA = 'no')])
No errors/flags are thrown.
However, while dataframes in dataframe_list_zero_var_rm have fewer columns than they do in dataframe_list, they still have columns that have zero variance, as revealed by:
zeroVar(dataframe_list_zero_var_rm[[1]][,3:ncol(dataframe_list_zero_var_rm)], useNA = 'no')
Passing the new dataframe to the original function shows me three columns with 0 variance which should have been removed in the first place.
This is a problem for me because I am trying to do principal components analysis on every dataframe in the list, but the zero variance columns become problematic for prcomp().
My ideal solution would be a way to
loop through each element of the dataframe list and remove columns from each dataframe that have zero variance
then, loop through each element of the dataframe list and perform prcomp() on the dataframe

You can use this approach from data.table:
library(data.table)
lapply(df_list,setDT) #convert all of your data.frames to data.tables
all_pos_var<-
lapply(df_list,function(dt){
dt[,unlist(dt[,lapply(names(dt)[3:ncol(dt)],
function(x){
if(diff(range(get(x)))!=0)x})]),with=F]})
The inner lapply gets the column names of all non-0-variance (equivalent to non-0-range) functions: lapply(names(dt),function(x)if(diff(range(get(x)))!=0)x).
The outer lapply applies this procedure to all of your data.frame/data.tables.
Test data:
set.seed(101)
dt1<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
zv1=rep(1,10),nzv2=runif(10),
zv3=rep(2,10),nzv4=runif(10))
dt2<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
zv1=rep(3,10),nzv2=rnorm(10),
zv3=rep(4,10),nzv4=rnorm(10),
zv5=rep(5,10),nzv6=rnorm(10))
df_list<-list(dt1,dt2)
Only nzv* variables should be returned; indeed:
> lapply(all_pos_var,names)
[[1]]
[1] "nzv2" "nzv4"
[[2]]
[1] "nzv2" "nzv4" "nzv6"
On trying to wrap your head around double lapply:
First, try to understand what the inner lapply is doing by focusing on a single data.frame:
setDT(dt1)
rel_cols<-names(dt1)[3:ncol(dt1)]
The inner lapply is:
nzcols<-dt1[,lapply(rel_cols,function(x)if(diff(range(get(x)))!=0)x)]
> nzcols
V1 V2
1: nzv2 nzv4
The unlist part converts nzcols to a character vector, which can then be used to subset dt1 (note that we need to use the parameter with=F when passing quoted column names to a data.table):
> dt1[,unlist(nzcols),with=F]
nzv2 nzv4
1: 0.43496175 0.07921225
2: 0.44205468 0.43388945
3: 0.76068946 0.67977425
4: 0.33296130 0.73435624
5: 0.39435715 0.45251087
6: 0.23329428 0.78378572
7: 0.07160766 0.67983554
8: 0.91338349 0.51870365
9: 0.77169357 0.69080575
10: 0.10753664 0.58827565
The outer lapply simply applies this procedure to all of the data.tables in df_list.

extract data from only columns matching character strings

I have a dataset that looks something like this (but much larger)
Jul_08 <- c(1,0,2,0,3)
Aug_08 <- c(0,0,1,0,1)
Sep_08 <- c(0,1,0,0,1)
month<-c("Jul_08","Aug_08","Jul_08","Sep_08","Jul_08")
dataset <- data.frame(Jul_08 = Jul_08, Aug_08 = Aug_08, Sep_08=Sep_08,month=month)
For each row, I would to isolate the value for a select month only as indicated by the "month" field. In other words, for a given row, if the column "month" = Jul_08, then for a new "value" column, I would like to include the datum that pertained to the column "Jul_08" from that row.
In essence, the output would add this value column to the dataset
value<-c(1,0,2,0,3)
Creating this final dataset
dataset.value<-cbind(dataset,value)

You can use matrix indexing:
w <- match(month, names(dataset))
dataset$value <- dataset[ cbind(seq_len(nrow(dataset)), w) ]
Here the w vector tells R which column to take the value from and seq_len is used to say use the same row, so the value column is constructed by taking the 1st column in the 1st row, then the 2nd column and 2nd row, 1st column for the 3rd row, etc.

You can use lapply :
value <- unlist(lapply(1:nrow(dataset),
function(r){
dataset[r,as.character(dataset[r,'month'])]
}))
> value
[1] 1 0 2 0 3
Or, alternatively :
value <- diag(as.matrix(dataset[,as.character(dataset$month)]))
> value
[1] 1 0 2 0 3
Then you can cbind the new column as you did in your example.
Some notes:
I prefer unlist(lapply(...)) over sapply since automagic simplification implemented in sapply function tends to surprise me sometimes. But I'm pretty sure this time you can use it without any problem.
as.character is necessary only if month column is a factor (as in the example), otherwise is redundant (but I would leave it, just to be safe).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Computing subset of column means in data frame (R programming) - r

by() is not the right tool, because it is a wrapper for tapply(), which partitions your data frame into subsets that meet some criteria. If you had another column, say fourth, you could split your data frame using by() for that column and then operate on rows or columns using apply().

Related

How to find specific rows that are present in one data frame and not another

R replace identical column character items with increasing number

Obtain a data.frame (or list) of X times the the original source, after applying some function

R: Removing zero variance columns from each element of dataframe list

extract data from only columns matching character strings

Categories

Resources