Divide many columns by another column - r

Ok with
A <- c(1:10)
B <- c(2:11)
C <- c(3:12)
df1 <- data.frame(A,B,C)
I do not understand this error:
df2 <- df1 / df1[,"C"]
df2 <- df1[1:3,] / df1[1:3,"C"]
a <- subset (df1, select = c(A, B))
b <- subset (df1, select = c (C))
c <- a/b
## Error in Ops.data.frame(a, b) :
## ‘/’ only defined for equally-sized data frames
seeing that both have the same number of rows:
dim(a)
dim(b)

R automatically drops dimensions (unless you explicitly specify drop=FALSE) when the use of matrix indexing results in a dimension of size 1 (i.e., one row or one column), but using subset() on a data frame always results in a data frame (even if it's only one column):
> str(b)
'data.frame': 10 obs. of 1 variable:
$ C: int 3 4 5 6 7 8 9 10 11 12
> str(df1[,"C"])
int [1:10] 3 4 5 6 7 8 9 10 11 12
So dividing by df1[,"C"] is dividing by a numeric (integer) vector rather than by a data frame. The error ‘/’ only defined for equally-sized data frames means that the two data frames should be exactly equally-sized (same number of rows and columns).
sweep(df1,df1[,"C"],MARGIN=1,"/") might be safer.

Since they are not equally sized, i.e. different number of columns, it will throw error.
Below will divide a to b
c <- a/b$C

Related

How to combine columns that have the same name and remove NA's?

Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.
An example of what I would like is below (although on a much larger scale).
df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
x = c(NA,NA,NA,NA,9,NA,NA),
y = c(NA,5,12,"hop",NA,2,NA),
y = c(2,NA,NA,NA,8,NA,4),
z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"),
y = c(2,5,12,"hop",8,2,4),
z = c(9,5,NA,3,2,6,NA))
I have tried a number of things including suggestions such as:
R: merging columns and the values if they have the same column name
Combine column to remove NA's
However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).
I have managed to solve the issue manually by creating new columns that are combinations:
df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])
However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.
Thanks
here is a base R approach
#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)
# x y z
# 1 2 2 9
# 2 1 5 5
# 3 3 12 NA
# 4 5 hop 3
# 5 9 8 2
# 6 12 2 6
# 7 blah 4 NA
# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of 3 variables:
# $ x: chr "2" "1" "3" "5" ...
# $ y: chr " 2" "5" "12" "hop" ...
# $ z: num 9 5 NA 3 2 6 NA

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

How to look up the value of a row/column combination in a matrix (R)?

I have the following look up problem in R (but I am not sure whether I use this term 100% correctly). Given is a matrix with data points, where row and column names are identical and in the same order (as in e.g., a covariance matrix). Also given is a data.frame of row and column name pairs for which the corresponding value should be looked up in the matrix.
To illustrate (and using a non-symmetric matrix for generality):
set.seed(1)
m = matrix(1:25,5,5)
colnames(m) <- c("A","B","C","D","E")
rownames(m) <- c("A","B","C","D","E")
l <- matrix(ncol=2,nrow=5)
for(i in 1:5){
l[i,] <- sample(c("A","B","C","D","E"),2,replace = FALSE) #choose TRUE if diagonal elements should be included in the list
}
l <- as.data.frame(l)
colnames(l) <- c("row","column")
So we have matrix ´m´ and data.frame l (the equal number of rows of m and l are coincidental and nrow(l) could be much higher, though redundant pairs surely occur for >25):
A B C D E
A 1 6 11 16 21
B 2 7 12 17 22
C 3 8 13 18 23
D 4 9 14 19 24
E 5 10 15 20 25
row column
1 B E
2 C D
3 B D
4 E C
5 D A
And we seek an algorithm that finds:
> c(22,18,17,15,4)
I'd be happy to pointers how this problem is referred correctly to and practical solutions.
You can use matrix subsetting on the row names as follows:
m[cbind(as.character(l$row), as.character(l$column))]
[1] 22 18 17 15 4
From the help file help("["), it says:
regarding matrix subsetting:
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i.
Also, regarding character subsetting:
Character vectors will be matched to the names of the object (or for matrices/arrays, the dimnames).
These two features combine to achieve what you are looking for.

How to rename all columns in a dataframe to include the name of the data for all dataframes in a list?

I have a list of data frames listofdfs. To rename the columns of one of the dataframes singledf in the list, the following code works:
colnames(listofdfs[["singledf"]]) <- paste(colnames(listofdfs[["singledf"]]), "singledf")
Aim
To rename all columns for all data frames in a list of dataframes, listofdfs, to include the name of the dataframe in all of the respective column names.
Attempt 1
for (i in listofdfs){
colnames(listofdfs[i]) <- paste(colnames(listofdfs[i]), i)
}
This error occurs
Error in `*tmp*`[i] : invalid subscript type 'list'
Attempt 2
for (i in listofdfs){
newnames <- paste(colnames(listofdfs[i]), i)
colnames(bsl) <- newnames
}
This error occurs
No error is printed, however when I check one of the dataframes' columns the column names remain unchanged.
Attempt 3
for (i in listofdfs){
colnames(listofdfs[[i]]) <- paste(colnames(listofdfs[[i]]), i)
}
This error occurs
Error in listofdfs[[i]] : invalid subscript type 'list'
Below is a code that renames the column names of each data.frame in a list in a way so that names of data.frames are added to original column names.
# example data
a <- data.frame(col1 = 1:10, col2 = 10:1)
b <- data.frame(col_01 = 11:20, col_02 = 20:11)
# list of data.frames
list_of_df <- list(a, b)
# names of data.frames
names(list_of_df) <- c("a", "b")
# my sequence and names of data.frames in a list
my_seq <- seq_along(list_of_df)
my_list_names <- names(list_of_df)
# procedure
for (i in my_seq) {
names(list_of_df[[my_seq[i]]]) <-
paste(my_list_names[i], names(list_of_df[[my_seq[i]]]), sep = "_")
}
list_of_df
$a
a_col1 a_col2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
$b
b_col_01 b_col_02
1 11 20
2 12 19
3 13 18
4 14 17
5 15 16
6 16 15
7 17 14
8 18 13
9 19 12
10 20 11
you are almost there. do something like this
for (i in seq_along(listofdfs)){
colnames(listofdfs[[i]]) <- paste(colnames(listofdfs[[i]]), i)
}
this should execute your logic of creating column names without any error.
Why
for (i in listofdfs){
colnames(listofdfs[[i]]) <- paste(colnames(listofdfs[[i]]), i)
}
Because you are expecting i to be a index but rather it is a data.frame itself. Debug using print
for (i in listofdfs){
print(class(i))
}
this is what you get
[1] "data.frame"
[1] "data.frame"
you cant subscript using a data.frame. forloop with in operator in R iterates along the individual elements and not their indexes. Hence we have to use seq_along
Hope this helps.
Consider a Dataframe df haveing 5 columns like:-
col1,col2,col3,col4,col5 and you need to rename them to
name,age,DOB,city,country
You can use a simple approach to do this
val renamedColumns=df.toDF("name","age","DOB","city","country")

remove factors with criteria

I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
In R, how can I remove rows with a factor that has a low total number of instances.
I've tried the following:
create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set.
I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.
There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)
# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]
#Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.
biginstances <- studenttable>1000
This will create a logical vector whose length is equal the number of unique student id's. studenttable contained a count for each unique value of data$Anon.Student.Id. When you try to use that logical vector in subset:
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
it's length is almost surely much less than the number of rows in data. And since the subsetting criteria in subset is meant to identify rows of data, R's recycling rules take over and you get 'weird' looking subsets.
I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:
> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c
So you'll want to use droplevels if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE).
Another approach will involve a join between your dataset and the table of interest.
I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)
require(plyr)
set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
var2 = 1:100)
R> table(Data$var1)
A B C D E
19 20 21 22 18
## rows with category less than 20
mytable <- count(Data, vars = "var1")
## mytable <- as.data.frame(table(Data$var1))
R> str(mytable)
'data.frame': 5 obs. of 2 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ freq: int 19 20 21 22 18
Data <- join(Data, mytable)
## Data <- merge(Data, mytable)
R> str(Data)
'data.frame': 100 obs. of 3 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
$ var2: int 1 2 3 4 5 6 7 8 9 10 ...
$ freq: int 21 20 21 18 21 18 18 22 21 19 ...
mysubset <- droplevels(subset(Data, freq > 20))
R> table(mysubset$var1)
C D
21 22
Hope this help..
this is how I managed to do this.
I sorted the table of factors and associated counts.
studenttable <- sort(studenttable, decreasing=TRUE)
now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.
sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574
now we know the first 230 factor levels are the ones we care about. So, we can do
idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]
we can verify it worked by doing
bigstudenttable <- table(bigdata$Anon.Student.Id)
to get a print out and see all the factor levels with less than 1000 instances are now 0.

Resources