I would like to read a large .csv into R. It'd handy to split it into various objects and treat them separately. I managed to do this with a while loop, assigning each tenth to an object:
# The dataset is larger, numbers are fictitious
n <- 0
while(n < 10000){
a <- paste('a_', n, sep = '')
assign(a, read.csv('df.csv',
header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)))
# There will be some additional processing here (omitted)
n <- n + 1000
}
Is there a more R-like way of doing this? I immediately thought of lapply. According to my understanding each object would be the element of a list that I would then have to unlist.
I gave a shot to the following but it didn't work and my list only has one element:
A <- lapply('df.csv', read.csv,
header = F, stringsAsFactors = F, nrows = 1000, skip = seq(0, 10000, 1000))
What am I missing? How do I proceed from here? How do I then unlist A and specify each element of the list as a separate data.frame?
If you apply lapply to a single element you'll have only one element as an output.
You probably want to do this:
a <- paste0('a_', 1:1000) # all your 'a's
A <- lapply(a,function(x){
read.csv('df.csv', header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)
})
for each element of a, called x because it's the name I chose as my function parameter, I execute your command. A will be a list of the results.
Edit: As #Val mentions in comments, assign seems not needed here, so I removed it, you'll end up with a list of data.frames coming from your csvs if all works fine.
Related
I just started using R and I have 5 files(each file has only one column) of data with 227 observations in total. I have to find E(X) and E(X^2). I found E(X) by summing up all the values and dividing it by 227. I also need to find E(X^2) but I don't know how to loop through the 5 files and get each individual value and square it.
I have code for loading the files and that is my code for finding the mean:
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
sum1 <- sum(mydataset1)
sum2 <- sum(mydataset2)
sum3 <- sum(mydataset3)
sum4 <- sum(mydataset4)
sum5 <- sum(mydataset5)
sumAll <- sum1 + sum2 + sum3 + sum4 + sum5
mean <- sumAll / 227
We can get all the datasets in a list with mget based on the pattern of object names get the sum from the list elements into a vector and then get the sum of that vector divided by 227
sum(sapply(mget(ls(pattern = '^mydataset\\d+$')), sum))/227
You can simply square the variable
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
sum1 <- sum(mydataset1 ^ 2)
sum2 <- sum(mydataset2 ^ 2)
sum3 <- sum(mydataset3 ^ 2)
sum4 <- sum(mydataset4 ^ 2)
sum5 <- sum(mydataset5 ^ 2)
The rest of your code will be the same
Maybe you can try a base R code like below
sum(unlist(mget(ls(pattern = "mydataset\\d+"))))/227
Mathematically, sumAll is the sum of all data from mydataset1 to mydataset5, In this sense, you can gather them via unlist and then sum them up before being divided by 227.
#Hugo actually answers the simple question of how do you square a variable in R and then do an operation on it. I think we all assume you don't really want to create a new variable that is X1 squared (but you could do that if you wanted).
I'm going to suggest maybe a more beginner solution than some of the above if what you are doing is trying to learn the basics of R.
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
combined <- rbind(mydataset1, mydataset2, mydataset3, mydataset4, mydataset5)
sum(combined$X1)/nrow(combined)
sum(combined$X1^2)/nrow(combined)
In this solution you are still reading the individual files and type out their names, and as shown in other answers there are lots of neat ways to do that automatically. This will always work.
Here what I'm doing is combining the data frames/ tibbles using the base rbind() function. It does what it sounds like, binds the data frames together.
Then I am doing the calculation, but instead of assuming that I know the number of rows, I'm getting it from the data. (If you have missing data that becomes a little more tricky but you will learn that soon enough.)
Also note that I am specifying the actual variable that you want. That is so you have a model for a situation in the future when you have multiple variables in your data frame.
I have a data frame which looks like this
value <- c(1:1000)
group <- c(1:5)
df <- data.frame(value,group)
And I want to use this function on my data frame
myfun <- function(){
wz1 <- df[sample(nrow(df), size = 300, replace = FALSE),]
wz2 <- df[sample(nrow(df), size = 10, replace = FALSE),]
wz3 <- df[sample(nrow(df), size = 100, replace = FALSE),]
wz4 <- df[sample(nrow(df), size = 40, replace = FALSE),]
wz5 <- df[sample(nrow(df), size = 50, replace = FALSE),]
wza <- rbind(wz1,wz2, wz3, wz4, wz5)
wza_sum <- aggregate(wza, by = list(group_ID=wza$group), FUN = sum)
return(list(wza = wza,wza_sum = wza_sum))
}
Right now I am returning one list which includes wza and wza_sum.
Is there a way to return two separate list in which one contains wza and the other list contains wza_sum?
The aggregate() function needs to be in myfun() because I want to replicate myfun() 100 times using
dfx <- replicate(100,myfun(),simplify = FALSE,)
A function should take one input (or set of inputs), and return only one output (or a set of outputs). Consider the simple example of
myfunction <- function(x) {
x
x ** 2
}
Unless you are calling return() early (which you usually don't), the last object is returned. In fact, if you try to return two objects, e.g. return(1,2) you are met with
Error in return(1, 2) : multi-argument returns are not permitted
That is why the solution proposed by #StupidWolf in the comments is the most appropriate one, where you use return(list(wza = list(wza),wza_sum = list(wza_sum))). You then have to perform the necessary post-processing of splitting the lists if appropriate.
I am simulating dice throws, and would like to save the output in a single object, but cannot find a way to do so. I tried looking here, here, and here, but they do not seem to answer my question.
Here is my attempt to assign the result of a 20 x 3 trial to an object:
set.seed(1)
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
print(trials)
i = i+1
}
print(Twenty)
What I do not understand is why I cannot recall the function after it is run?
I also tried using return instead of print in the function:
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
return(trials)
i = i+1
}
print(Twenty)
or creating an empty matrix first:
mat = matrix(0, nrow = 20, ncol = 3)
mat
for(i in 1:20){
mat[i] = sample.int(6, 3, replace = TRUE)
print(mat)
i = i+1
}
but they seem to be worse (as I do not even get to see the trials).
Thanks for any hints.
There are several things wrong with your attempts:
1) A loop is not a function nor an object in R, so it doesn't make sense to assign a loop to a variable
2) When you have a loop for(i in 1:20), the loop will increment i so it doesn't make sense to add i = i + 1.
Your last attempt implemented correctly would look like this:
mat <- matrix(0, nrow = 20, ncol = 3)
for(i in 1:20){
mat[i, ] = sample.int(6, 3, replace = TRUE)
}
print(mat)
I personally would simply do
matrix(sample.int(6, 20 * 3, replace = TRUE), nrow = 20)
(since all draws are independent and with replacement, it doesn't matter if you make 3 draws 20 times or simply 60 draws)
Usually, in most programming languages one does not assign objects to for loops as they are not formally function objects. One uses loops to interact iteratively on existing objects. However, R maintains the apply family that saves iterative outputs to objects in same length as inputs.
Consider lapply (list apply) for list output or sapply (simplified apply) for matrix output:
# LIST OUTPUT
Twenty <- lapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
# MATRIX OUTPUT
Twenty <- sapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
And to see your trials, simply print out the object
print(Twenty)
But since you never use the iterator variable, x, consider replicate (wrapper to sapply which by one argument can output a matrix or a list) that receives size and expression (no sequence inputs or functions) arguments:
# MATRIX OUTPUT (DEFAULT)
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE))
# LIST OUTPUT
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE), simplify = FALSE)
You can use list:
Twenty=list()
for(i in 1:20){
Twenty[[i]] = sample.int(6, 3, replace = TRUE)
}
How can I generate all two way tables from a data frame in R?
some_data <- data.frame(replicate(100, base::sample(1:4, size = 50, replace = TRUE)))
combos <- combn(names(some_data), 2)
The following does not work, was planning to wrap a for loop around it and store results from each iteration somewhere
i=1
table(some_data[combos[, i][1]], some_data[combos[, i][2]])
Why does this not work? individual arguments evaluate as expected:
some_data[combos[, i][1]]
some_data[combos[, i][2]]
Calling it with the variable names directly yields the desired result, but how to loop through all combos in this structure?
table(some_data$X1, some_data$X2)
With combn, there is the FUN argument, so we can use that to extract the 'some_data' and then get the table output in an array
out <- combn(names(some_data), 2, FUN = function(i) table(some_data[i]))
Regarding the issue in the OP's post
table(some_data[combos[, i][1]], some_data[combos[, i][2]])
Both of them are data.frames, we can extract as a vector and it should work
table(some_data[, combos[, i][1]], some_data[, combos[, i][2]])
^^ ^^
or more compactly
table(some_data[combos[, i]])
Update
combn by default have simplify = TRUE, that is it would convert the output to an array. Suppose, if we have combinations that are not symmetric, then this will result in different dimensions of the table output unless we convert it to factor with levels specified. An array can hold only a fixed dimensions. If some of the elements changes in dimension, it result in error as it is an array. One way is to use simplify = FALSE to return a list and list doesn't have that restriction.
Here is an example where the previous code fails
set.seed(24)
some_data2 <- data.frame(replicate(5, base::sample(1:10, size = 50,
replace = TRUE)))
some_data <- data.frame(some_data, some_data2)
out1 <- combn(names(some_data), 2, FUN = function(i)
table(some_data[i]), simplify = FALSE)
is.list(out1)
#[1] TRUE
length(out1)
#[1] 5460
I am trying not to use a for loop to assign values to the elements of a list.
Here, I create an empty list, gives it a length of 20 and name each of the 20 elements.
mylist <- list()
length(mylist) <- 20
names(mylist) <- paste0("element", 1:20, sep = "")
I want each element of mylist to contain samples drawn from a pool of randomly generated numbers denoted as x:
x <- runif(100, 0, 1)
I tried the following codes, which do not get to the desired result:
mylist[[]] <- sample(x = x, size = 20, replace = TRUE) # Gives an error
mylist[[1:length(mylist)]] <- sample(x = x, size = 20, replace = TRUE) # Does not give the desired result
mylist[1:length(mylist)] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
mylist[] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
P.S. As explained above, the desired result is a list of 20 elements, which individually contains 20 numeric values. I can do it using a for loop, but I would like to become a better R user and use vectorized operations as much as possible.
Thank you for your help.
Maybe replicate is what you're looking for.
mylist <- replicate(20, sample(x = x, size = 20, replace = TRUE), simplify=FALSE)
names(mylist) <- paste0("element", 1:20, sep = "")
Note that there is no need to first create a list, replicate will do it for you.
Since you're using replace=TRUE you could also generate all 400 at once and then split them up. If you were doing this many times, this probably would be faster than replicate. For only 20 times, the speed difference won't matter hardly at all and tje code using replicate is perhaps easier to read and understand and so might be preferred for that reason.
foo <- sample(x = x, size = 20*20, replace = TRUE)
mylist <- split(foo, rep(1:20, each=20))
Alternatively, you could split them by converting to a data frame first. Not sure which would be faster.
mylist <- as.list(as.data.frame(matrix(foo, ncol=20)))