R repeat elements of data frame - r

I have searched the internet, but I haven't been able to find a solution to my problem.
I have a data frame of numbers and characters:
mydf <- data.frame(col1=c(1, 2, 3, 4),
col2 = c(5, 6, 7, 8),
col3 = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
mydf:
col1 col2 col3
1 5 a
2 6 b
3 7 c
4 8 d
I would like to repeat this into
col1 col2 col3
1 5 a
1 5 a
1 5 a
2 6 b
2 6 b
2 6 b
3 7 c
3 7 c
3 7 c
4 8 d
4 8 d
4 8 d
Using apply(mydf, 2, function(x) rep(x, each = 3)) will give the right repetition, but will not conserve the classes of col1, col2, and col3, as numeric, numeric and character, respectively, as I would like. This is a constructed example, and setting the classes of each column in my data frame is a bit tedious.
Is there a way to make the repetition while conserving the classes?

It's even easier than you think.
index <- rep(seq_len(nrow(mydf)), each = 3)
mydf[index, ]
This also avoids the implicit looping from apply.

This is an unfortunate and an unexpected class conversion (too me, anyway). Here's an easy workaround that uses the fact that a data.frame is just a special list.
data.frame(lapply(mydf, function(x) rep(x, each = 3)))
(anyone know why the behaviour the questioner observed shouldn't be reported as a bug?)

Just another solution:
mydf3 <- do.call(rbind, rep(list(mydf), 3))

Take a look at aggregate and disaggregate in the raster package. Or, use my modified version zexpand below:
# zexpand: analogous to disaggregate
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}

I really like Richie Cotton's answer.
But you could also simply use rbind and reorder it.
res <-rbind(mydf,mydf,mydf)
res[order(res[,1],res[,2],res[,3]),]

The package mefa comes with a nice wrapper for rep applied to data.frame. This will match your example in one line:
mefa:::rep.data.frame(mydf, each=3)

Related

between list calculations per row in R

Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)

How do you test if a pair of elements is in a data frame?

Let's say I have this data frame A :
A = data.frame(first=c("a", "b","c", "d"), second=c(1, 2, 3, 4))
first second
1 a 1
2 b 2
3 c 3
4 d 4
And I have this data frame B :
B = data.frame(first=c("x", "a", "c"), second=c(1, 4, 3))
first second
1 x 1
2 a 4
3 c 3
I want to count the number of times a pair of the data frame B (B$first, B$second) is in the data frame A. The counting part is not the problem, I just can't find the function to determine whether a pair is in a data frame.
The result would be that only c("c",3) is an element of A, so it should be 1. both "a" and 4 are in data frame A, but the couple c("a", 4) does not exist in data frame A, so I don't want to count this. I want the exact match.
I'm looking for a function like %in% that could work for pairs.
Thanks for your help
Maybe something like this
apply(B, 1, function(r, A){ sum(A$first==r[1] & A$second==r[2]) }, A)
Basically, what it does is the following: for every row of B it applies a function that inspects which elements of A are in accordance with row r from B (part A$first==r[1] & A$second==r[2]) and then sums obtained logicals to derive the number of rows in A that are in accordance with row r.
If you also want grouping it can easily be done with dplyr like this
cbind(B,tmp) %.% group_by(first,second) %.% summarise(n=max(tmp))
where tmp is a variable representing the result of the aforementioned apply
Here's an alternative: rbind your data.frames together and use duplicated.
AB <- do.call(rbind, mget(c("A", "B")))
AB$ind <- as.numeric(duplicated(AB))
AB[grep("^B", rownames(AB)), ]
# first second ind
# B.1 x 1 0
# B.2 a 4 0
# B.3 c 3 1
You can also probably try to use "digest" to generate a hash for each row, but I'm not sure how efficient this would be:
library(digest)
Reduce(function(x, y) y %in% x,
lapply(mget(c("A", "B")), function(x)
apply(x, 1, digest)))
# [1] FALSE FALSE TRUE
An alternative is to merge by row, e.g. mB<-apply(B,1,function(j) paste0(j[1],"_",j[2]) and similarly for A at which point you can loop mB[j]%in%mA[k]
Not that I would really recommend doing this :-)

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Fast melted data.table operations

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.
The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.
A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:
input <- data.table(
id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input,
id ~ variable, sum,
subset=.(variable %in% c('x', 'y')))
the output of which is
id x y
1 1 1 5
2 2 4 11
3 3 15 9
Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id in the key might be worth setkey's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.
Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]
> setkey(input, "id")
> input[ , list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 34
> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 24
The last one:
> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
id variable V1
1: 1 x 1
2: 1 y 5
3: 2 x 4
4: 2 y 11
5: 3 x 15
6: 3 y 9
I'm not sure if this is the best way, but you can try:
input[, list(x = sum(value[variable == "x"]),
y = sum(value[variable == "y"])), by = "id"]
# id x y
# 1: 1 1 5
# 2: 2 4 11
# 3: 3 15 9

Change the class from factor to numeric of many columns in a data frame

What is the quickest/best way to change a large number of columns to numeric from factor?
I used the following code but it appears to have re-ordered my data.
> head(stats[,1:2])
rk team
1 1 Washington Capitals*
2 2 San Jose Sharks*
3 3 Chicago Blackhawks*
4 4 Phoenix Coyotes*
5 5 New Jersey Devils*
6 6 Vancouver Canucks*
for(i in c(1,3:ncol(stats))) {
stats[,i] <- as.numeric(stats[,i])
}
> head(stats[,1:2])
rk team
1 2 Washington Capitals*
2 13 San Jose Sharks*
3 24 Chicago Blackhawks*
4 26 Phoenix Coyotes*
5 27 New Jersey Devils*
6 28 Vancouver Canucks*
What is the best way, short of naming every column as in:
df$colname <- as.numeric(ds$colname)
You have to be careful while changing factors to numeric. Here is a line of code that would change a set of columns from factor to numeric. I am assuming here that the columns to be changed to numeric are 1, 3, 4 and 5 respectively. You could change it accordingly
cols = c(1, 3, 4, 5);
df[,cols] = apply(df[,cols], 2, function(x) as.numeric(as.character(x)));
Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x) returning the internal, numeric representation of the factor x at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character() first as per Ramnath's example.
Your for loop is just as reasonable as an apply call and might be slightly more readable as to what the intention of the code is. Just change this line:
stats[,i] <- as.numeric(stats[,i])
to read
stats[,i] <- as.numeric(as.character(stats[,i]))
This is FAQ 7.10 in the R FAQ.
HTH
This can be done in one line, there's no need for a loop, be it a for-loop or an apply. Use unlist() instead :
# testdata
Df <- data.frame(
x = as.factor(sample(1:5,30,r=TRUE)),
y = as.factor(sample(1:5,30,r=TRUE)),
z = as.factor(sample(1:5,30,r=TRUE)),
w = as.factor(sample(1:5,30,r=TRUE))
)
##
Df[,c("y","w")] <- as.numeric(as.character(unlist(Df[,c("y","w")])))
str(Df)
Edit : for your code, this becomes :
id <- c(1,3:ncol(stats)))
stats[,id] <- as.numeric(as.character(unlist(stats[,id])))
Obviously, if you have a one-column data frame and you don't want the automatic dimension reduction of R to convert it to a vector, you'll have to add the drop=FALSE argument.
I know this question is long resolved, but I recently had a similar issue and think I've found a little more elegant and functional solution, although it requires the magrittr package.
library(magrittr)
cols = c(1, 3, 4, 5)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
The %<>% operator pipes and reassigns, which is very useful for keeping data cleaning and transformation simple. Now the list apply function is much easier to read, by only specifying the function you wish to apply.
Here are some dplyr options:
# by column type:
df %>%
mutate_if(is.factor, ~as.numeric(as.character(.)))
# by specific columns:
df %>%
mutate_at(vars(x, y, z), ~as.numeric(as.character(.)))
# all columns:
df %>%
mutate_all(~as.numeric(as.character(.)))
I think that ucfagls found why your loop is not working.
In case you still don't want use a loop here is solution with lapply:
factorToNumeric <- function(f) as.numeric(levels(f))[as.integer(f)]
cols <- c(1, 3:ncol(stats))
stats[cols] <- lapply(stats[cols], factorToNumeric)
Edit. I found simpler solution. It seems that as.matrix convert to character. So
stats[cols] <- as.numeric(as.matrix(stats[cols]))
should do what you want.
lapply is pretty much designed for this
unfactorize<-c("colA","colB")
df[,unfactorize]<-lapply(unfactorize, function(x) as.numeric(as.character(df[,x])))
I found this function on a couple other duplicate threads and have found it an elegant and general way to solve this problem. This thread shows up first on most searches on this topic, so I am sharing it here to save folks some time. I take no credit for this just so see the original posts here and here for details.
df <- data.frame(x = 1:10,
y = rep(1:2, 5),
k = rnorm(10, 5,2),
z = rep(c(2010, 2012, 2011, 2010, 1999), 2),
j = c(rep(c("a", "b", "c"), 3), "d"))
convert.magic <- function(obj, type){
FUN1 <- switch(type,
character = as.character,
numeric = as.numeric,
factor = as.factor)
out <- lapply(obj, FUN1)
as.data.frame(out)
}
str(df)
str(convert.magic(df, "character"))
str(convert.magic(df, "factor"))
df[, c("x", "y")] <- convert.magic(df[, c("x", "y")], "factor")
I would like to point out that if you have NAs in any column, simply using subscripts will not work. If there are NAs in the factor, you must use the apply script provided by Ramnath.
E.g.
Df <- data.frame(
x = c(NA,as.factor(sample(1:5,30,r=T))),
y = c(NA,as.factor(sample(1:5,30,r=T))),
z = c(NA,as.factor(sample(1:5,30,r=T))),
w = c(NA,as.factor(sample(1:5,30,r=T)))
)
Df[,c(1:4)] <- as.numeric(as.character(Df[,c(1:4)]))
Returns the following:
Warning message:
NAs introduced by coercion
> head(Df)
x y z w
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
But:
Df[,c(1:4)]= apply(Df[,c(1:4)], 2, function(x) as.numeric(as.character(x)))
Returns:
> head(Df)
x y z w
1 NA NA NA NA
2 2 3 4 1
3 1 5 3 4
4 2 3 4 1
5 5 3 5 5
6 4 2 4 4
you can use unfactor() function from "varhandle" package form CRAN:
library("varhandle")
my_iris <- data.frame(Sepal.Length = factor(iris$Sepal.Length),
sample_id = factor(1:nrow(iris)))
my_iris <- unfactor(my_iris)
I like this code because it's pretty handy:
data[] <- lapply(data, function(x) type.convert(as.character(x), as.is = TRUE)) #change all vars to their best fitting data type
It is not exactly what was asked for (convert to numeric), but in many cases even more appropriate.
Based on #SDahm's answer, this was an "optimal" solution for my tibble:
data %<>% lapply(type.convert) %>% as.data.table()
This requires dplyr and magrittr.
I tried a bunch of these on a similar problem and kept getting NAs. Base R has some really irritating coercion behaviors, which are generally fixed in Tidyverse packages. I used to avoid them because I didn't want to create dependencies, but they make life so much easier that now I don't even bother trying to figure out the Base R solution most of the time.
Here's the Tidyverse solution, which is extremely simple and elegant:
library(purrr)
mydf <- data.frame(
x1 = factor(c(3, 5, 4, 2, 1)),
x2 = factor(c("A", "C", "B", "D", "E")),
x3 = c(10, 8, 6, 4, 2))
map_df(mydf, as.numeric)
df$colname <- as.numeric(df$colname)
I tried this way for changing one column type and I think it is better than many other versions, if you are not going to change all column types
df$colname <- as.character(df$colname)
for the vice versa.
I had problems converting all columns to numeric with an apply() call:
apply(data, 2, as.numeric)
The problem turns out to be because some of the strings had a comma in them -- e.g. "1,024.63" instead of "1024.63" -- and R does not like this way of formatting numbers. So I removed them and then ran as.numeric():
data = as.data.frame(apply(data, 2, function(x) {
y = str_replace_all(x, ",", "") #remove commas
return(as.numeric(y)) #then convert
}))
Note that this requires the stringr package to be loaded.
That's what's worked for me. The apply() function tries to coerce df to matrix and it returns NA's.
numeric.df <- as.data.frame(sapply(df, 2, as.numeric))

Resources