I am trying to generate a lot of test data for other programs.
Working in R Studio I import an SPSS sav file which has 73 variables and the values and labels recorded in it using Haven as a dataframe "td". This gives me all the variable names which I need to work with. Then I delete all the existing data.
td <- td[0,]
Then I generate 10,000 test data rows by loading the index IDs
td$ID <- 12340000:12349999
So far so good.
I have a constant called ThismanyRows <- 10000
I have a large list of Column header names in a variable called BinaryVariables
And a vector of valid values for it called CheckedOrNot <- c(NA, 1)
This is where the problem is:
td[,BinaryVariables] <- sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)
does fill all the columns with data. But its all exactly the same data, which isn't what I want.
I want the sample function to run against each column, but not each value in each column as in.
Even when
Fillbinary <- function () {sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)}
and
td <- lapply(td[,BinaryVariables],Fillbinary)
generates: Error in FUN(X[[i]], ...) : unused argument (X[[i]])
So far I have not been able to work out how to deal with each column as a column and apply the sample function to it.
Any help much appreciated.
Let's generate some fake data first for the example:
BinaryVariables <- c("v1","v2","v3")
CheckedOrNot <- c(NA, 1)
ThismanyRows <- 10
td <- data.frame(ID=1:10)
The issue is that you are generating 10 values and feeding that in to replace 3 * 10 values.
There's a couple of ways to solve this. You might initially think, well, I'll generate 10 values 3 times, like so:
td[BinaryVariables] <- replicate(length(BinaryVariables),
sample(x = CheckedOrNot, size=ThismanyRows, replace=TRUE),
simplify=FALSE)
That will work fine, but why sample 3 times if you can sample once and fill once?
td[BinaryVariables] <- sample(x = CheckedOrNot,
size=ThismanyRows*length(BinaryVariables), replace = TRUE)
And the (well, a) result shows that the values in each column are different:
# TD v1 v2 v3
#1 1 NA 1 1
#2 2 NA 1 1
#3 3 NA 1 NA
#4 4 NA 1 NA
#5 5 1 NA 1
#6 6 NA 1 1
#7 7 1 NA 1
#8 8 1 1 NA
#9 9 1 NA NA
#10 10 1 NA NA
Related
I want to create multiple columns in a dataframe that each calculate a different value based on values from an existing column.
Say I have the following dataframe:
date <- c('1','2','3','4','5')
close <- c('10','20','15','13','19')
test_df <- data.frame(date,close)
I want to create a new column that does the following operation with dplyr:
test_df %>%
mutate(logret = log(close / lag(close, n=1)))
However I would like to create a new column for multiple values of n such that I have columns:
logret1 for n=1,
logret2 for n=2,
logret3 for n=3
etc...
I've used the function seq(from=1, to=5, by=1) as an example to get a vector of numbers to replace n with. I've tried to create a for loop around the mutate function:
seq2 <- seq(from=1, to=5, by=1)
for (number in seq2){
new_df <- test_df %>%
mutate(logret = log(close/lag(close, n=seq2)))
}
However I get the error:
Error: Problem with `mutate()` input `logret`. x `n` must be a nonnegative integer scalar, not a double vector of length 5. i Input `logret` is `log(close2/lag(close2, n = seq2))`.
I realise I can't pass in a vector for n, however I am stuck on how to proceed.
Any help would be much appreciated, Thanks.
You can use purrr's map_dfc to add new columns :
library(dplyr)
library(purrr)
n <- 3
bind_cols(test_df, map_dfc(1:n, ~test_df %>%
transmute(!!paste0('logret', .x) := log(close / lag(close, n=.x)))))
# date close logret1 logret2 logret3
#1 1 10 NA NA NA
#2 2 20 0.6931472 NA NA
#3 3 15 -0.2876821 0.4054651 NA
#4 4 13 -0.1431008 -0.4307829 0.26236426
#5 5 19 0.3794896 0.2363888 -0.05129329
data
test_df <- data.frame(date,close)
test_df <- type.convert(test_df)
You can use data.table. It's an R package that provides an enhanced version of data.frame. This is an awesome resource to get started with https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/
library(data.table)
#Create data.table
test_dt <- data.table(date, close)
#Define the new cols names
logret_cols <- paste0('logret', 1:3)
#Create new columns
test_dt[, (logret_cols) := lapply(1:3, function(n) log(close / lag(close, n = n)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329
data.table has an interesting way to deal with memory efficiently. If you will deal with large amount of data, take a look at this benchmarks, are awesome:
https://h2oai.github.io/db-benchmark/
EDIT
You can even do it with a mix of data.table and purrr. Here's an example using the function purrr::map()
test_dt[, (logret_cols) := map(1:3, ~log(close / lag(close, n = .x)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329
I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5
I am trying to see if the amount of information that I have about a case is correlated to the duration of the user.
Currently, I have a dataframe, df, and I attempted to do the following:
df["amount_known"] <-df[rowSums(!is.na(df)),]
This resulted in the following error:
Error in [<-.data.frame(*tmp*, "amount_known", value = list(status = c(3L, :
replacement element 1 has 808047 rows, need 808247
What could cause this to happen (and of course, how do I fix it)?
If you want the number of non-NA entries in a new column amount_known in df you can do it like this:
df$amount_known <-rowSums(!is.na(df))
Here's a small example of what is happening:
df <- data.frame(x = 1:3, y = 66:68)
df$y[1] <- NA
df$x[3] <- NA
df
# x y
#1 1 NA
#2 2 67
#3 NA 68
rowSums(!is.na(df))
#[1] 1 2 1
This results in a vector with the number of non-NAs in df.
Now, if you do
df[rowSums(!is.na(df)),]
This will select the rows in the vector c(1,2,1) from df:
# x y
#1 1 NA
#2 2 67
#1.1 1 NA
So for example, row 1 is shown twice.
And in your code, you were then assigning that output to a new column in df.
Given
index = c(1,2,3,4,5)
codes = c("c1","c1,c2","","c3,c1","c2")
df=data.frame(index,codes)
df
index codes
1 1 c1
2 2 c1,c2
3 3
4 4 c3,c1
5 5 c2
How can I create a new df that looks like
df1
index codes
1 1 c1
2 2 c1
3 2 c2
4 3
5 4 c3
6 4 c1
7 5 c2
so that I can perform aggregates on the codes? The "index" of the actual data set are a series of timestamps, so I'll want to aggregate by day or hour.
The method of Roland is quite good, provided the variable index has unique keys. You can gain some speed by working with the lists directly. Take into account that :
in your original data frame, codes is a factor. No point in doing that, you want it to be character.
in your original data frame, "" is used instead of NA. As the length of that one is 0, you can get in all kind of trouble later on. I'd use NA there. " " is an actual value, "" is no value at all, but you want a missing value. Hence NA.
So my idea would be:
The data:
index = c(1,2,3,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2")
df=data.frame(index,codes,stringsAsFactors=FALSE)
Then :
X <- strsplit(df$codes,",")
data.frame(
index = rep(df$index,sapply(X,length)),
codes = unlist(X)
)
Or, if you insist on using "" instead of NA:
X <- strsplit(df$codes,",")
ll <- sapply(X,length)
X[ll==0] <- NA
data.frame(
index = rep(df$index,pmax(1,ll)),
codes = unlist(X)
)
Neither of both methods assume a unique key in index. They work perfectly well with non-unique timestamps.
You need to split the string (using strsplit) and then combine the resulting list with the data.frame.
The following relies on the assumption that codes are unique in each row. If you have many codes in some rows and only few in others, this might waste a lot of RAM and it might be better to loop.
#to avoid character(0), which would be omitted in rbind
levels(df$codes)[levels(df$codes)==""] <- " "
#rbind fills each row by propagating the values to the "empty" columns for each row
df2 <- cbind(df, do.call(rbind,strsplit(as.character(df$codes),",")))[,-2]
library(reshape2)
df2 <- melt(df2, id="index")[-2]
#here the assumtion is needed
df2 <- df2[!duplicated(df2),]
df2[order(df2[,1], df2[,2]),]
# index value
#1 1 c1
#2 2 c1
#7 2 c2
#3 3
#9 4 c1
#4 4 c3
#5 5 c2
Here's another alternative using "data.table". The sample data includes NA instead of a blank space and includes duplicated index values:
index = c(1,2,3,2,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2","c3")
df = data.frame(index,codes,stringsAsFactors=FALSE)
library(data.table)
## We could create the data.table directly, but I'm
## assuming you already have a data.frame ready to work with
DT <- data.table(df)
DT[, list(codes = unlist(strsplit(codes, ","))), by = "index"]
# index codes
# 1: 1 c1
# 2: 2 c1
# 3: 2 c2
# 4: 2 c3
# 5: 2 c1
# 6: 3 NA
# 7: 4 c2
# 8: 5 c3
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]