In R, I have an object dataList which is a list, where each entry is a dataframe. Each dataframe has 2 columns, both of the same length (300, if it matters. dataList is 1000 entries long).
I need to take the average of all of the ith positions within this list. I.e. I need the average of all of the entries (i,2) of each dataframe. So, all 300 of the (1,2) entries should be averaged and I would like this number to be stored in the 1st spot of a new list.
I am open to any solutions as to how to do this; if there is a better way to store the data that would probably be preferable.
Here's a minimal example which should help you:
# create dummy data
d1 <- data.frame(weight = c(23,78,98,50), height=c(50,170,190,150))
d2 <- data.frame(weight = c(13,58,78,90), height=c(20,140,172,200))
# create a list
data_list <- list(d1,d2)
# find mean of second colum in a new list
l1 <- lapply(data_list, function(x) mean(x[[2]]))
print(l1)
[[1]]
[1] 140
[[2]]
[1] 133
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I have 2 simple but long dataframes and I would like to compare the column of 1 data frame to a specific column in another data frame to see how many and which values are the same.
For example, the word "HAT" might be present in row 1 in the 1st data frame, and it might be present in row 76 of the 2nd data frame. I want the output to tell me that the word "HAT" is present in both dataframes (along with all the other similarities), rather than just tell me how many values match up.
Please let me know if there is a function I can use! Comparedf is not working well. It would also be best if I could get the results in the form of another data frame.
Using indexing and the %in% operator can help:
use %in% to find which values in the first dataframe are also in the second dataframe
use the logical vector to index the column, and return the corresponding values
use unique() if you don't want repeats
use data.frame() to construct a dataframe with one column of common values
# create two example dataframes:
df1 <- data.frame(chars = rep(LETTERS[1:10], 2))
df2 <- data.frame(chars = LETTERS[8:20])
# find the common values:
df1$chars[df1$chars %in% df2$chars]
#> [1] "H" "I" "J" "H" "I" "J"
# alternatively, only show the unique values:
common <- unique(df1$chars[df1$chars %in% df2$chars])
# and create a dataframe from it:
common_df <- data.frame(common)
# see contents:
common_df
#> common
#> 1 H
#> 2 I
#> 3 J
Created on 2021-04-14 by the reprex package (v2.0.0)
I have a list of data frames "181", and i want to extract the 2nd column and save it in a csv file and label it, the labels for those 181 dfs are 0,1,2,3,4,5,6.
The problem is i have different length for each df, and i don't know if that's applicable in R!
This is an inefficient but easily coded solution (and efficiency doesn't matter when all you need to do is output a short CSV file). It writes each data frame one line at a time, assuming the data frames are represented by a list l.df.
#
# Prepare for output and input.
#
fn <- "temp.csv"
if(is.null(names(l.df))) names(l.df) <- 1:length(l.df)
#
# Loop over the data frames.
#
append <- FALSE
for (s in names(l.df)) {
#
# Create a one-row data frame for this column.
#
X <- data.frame(ID=s, as.list(l.df[[s]][[2]]))
#
# Append it to the output.
#
write.table(X, file=fn, sep=",", row.names=FALSE, col.names=FALSE, append=append)
append <- TRUE
}
For example, we may prepare a set of data frames with random entries:
set.seed(17)
l.df <- lapply(1+rpois(181, 5), function(n) data.frame(X=1:n, Y=round(rnorm(n),2)))
The output file looks like this:
"1",0.37,1.61,0.02,0.51
"2",1.07,0.13,-0.55,0.34,2.24,0.41,0.26,0.13,-0.48,0.07,0.54
... (177 lines omitted)
"180",0.58,-1.5,1.85,-1.02
"181",-0.59,0.12,-0.38,-0.35,1.22,-0.63,0.81
There are many ways of solving your issues, I'll just propose the simplest one with base R, looping 🙌 (otherwise work with tidyverse).
The Issue of differing df lengths (in terms of rows) can be solved by adding NAs at the end.
I assume this is your setup:
# Your list of data frames
yourlistofdataframes <- list()
for (i in 1:182) { # in R list indices run from 1 to 181 (in Python from 0 onwards)
nrowofdf <- sample(1:100,1) # random number of rows between 1 and 100
yourlistofdataframes[[i]] <- data.frame(cbind(rep(paste0("df",i,"|column1"),nrowofdf),
rep(paste0("df",i,"|column2"),nrowofdf),
rep(paste0("df",i,"|column3"),nrowofdf)))
}
names(yourlistofdataframes) <- 0:181 # labeling the data frames
Then this is your solution:
newlist <- list()
for (i in 1:length(yourlistofdataframes)){
newlist[[i]] <- unlist(yourlistofdataframes[[i]][2])
}
names(newlist) <- 0:181 # give them the names you wanted
newlist <- lapply(newlist, `length<-`, max(lengths(newlist))) # add NA's to make them equal length
# bind back to data.frame & save as csv
newdf <- data.frame(newlist) # if you want to have the data in 181 columns in your final df
newdft <- t(newdf) # if you want to have the data in 181 rows in your final df
write.csv(newdf, "mycsv.csv")
Feedback on your question:
Also, if you want to ask for coding advice, post some representation of your data, so that people don't have to assume how your data looks like / build their own.
Ok, first let me define the project I'm working on and what I need as results. I have this data of 332 comma separated value files which is actually the US department of health's statistical data on pollution parameters for nitrate and sulphate pollutants. There are 332 IDs for a list of 332 entities that monitored pollution across all cities and I have to conduct some analysis on it.
So the data table looks something like this:
Date (yyyy-mm-dd) sulphate nitrate ID
xxxx-xx-01 some value some value 5
xxxx-xx-02 some value some value 5
So theres daily data for monitor number 5. Similar is with the rest of the 332 IDs.
Now my job requires me to write a function that displays the number of non-NA values according to each ID. Ok? Which means that if I specify ID 1, then the function will process the data table and provide me with a simple data frame with the ID number and the number of non-NA values for that ID.
Now, my current function performs the fact manner as it should if I specify only one ID. but when I specify a combination of IDs, the function provides me the sum of all IDs specified, that is not what i want actually.
This is the full function:
complete <- function(directory, id = 1:332){
file.list <- list.files(directory, full.names = TRUE) ## list of files need to be read
pollutant.data <- data.frame() ## forming an empty data frame
for (i in 1:332){
pollutant.data <- rbind(pollutant.data, read.csv(files))} ## final data frame
ID <- pollutant.data[which(pollutant.data[,'ID'] %in% id),] ## for the id vector that needs to be specified in the function
good <- complete.cases(pollutant.data)
nobs <- sum(complete.cases(ID)) ## sum of complete data according to ID
return(data.frame(id, nobs = nobs))
}
Now when I specify say, id = 1, the function give me the correct result. But when I specify more than 1 ID, like so, c(3,6,4), it provides me with the addition of all the three IDs with each row for individual ID, and that is where I'm unable to resolve.
Now, I believe that I need to work on the for loop to specify that sum needs to be found for each ID, I tried inserting the for loop between the good and nobs vector but still I get the wrong values.
This line is the culprit:
ID <- pollutant.data[which(pollutant.data[,'ID'] %in% id),]
You are asking for all of the IDs that match the vector you pass, so you are in no way separating by ID.
A leaner way to do it is:
# Generate a mock dataset
test.data <- data.frame(ID = rep(1:5, each = 10), value = rnorm(50))
# Every 5th element is NA
test.data$value[seq(1, 50, 5)] <- NA
countComplete <- function(data.frame, id)
{
# Get only the subset of data we need (only requested id)
tmp <- subset(data.frame, ID %in% id)
# Split it by ID
tmp.split <- split(tmp, tmp$ID)
# Use sapply to count complete cases for the different IDs
res <- sapply(tmp.split, function(x){
sum(complete.cases(x))
})
# Return the result
res
}
For instance:
countComplete(test.data, c(1,3))
Returns
1 3
8 8
I have a data frame (d) composed of 640 observations for 55 variables.
I would like to randomly sample this data frame in 10 sub data frame of 64
observations for 55 variables. I dont want any of the observation to be in
more than one sub data-frame.
This code work for one sample
d1 <- d[sample(nrow(d),64,replace=F),]
How can I repeat this treatment ten times ?
This one give me a dataframe of 10 variable (each one is one sample...)
d1 <- replicate(10,sample(nrow(d),64,replace = F))}
Can anyone help me?
Here's a solution that returns the result in a list of data.frames:
d <- data.frame(A=1:640, B=sample(LETTERS, 640, replace=TRUE)) # an exemplary data.frame
idx <- sample(rep(1:10, length.out=nrow(d)))
res <- split(d, idx)
res[[1]] # first data frame
res[[10]] # last data frame
The only tricky part involves creating idx. idx[i] identifies the resulting data.frame, idx[i] in {1,...,10}, in which the ith row of d will occur. Such an approach assures us that no row will be put into more than 1 data.frame.
Also, note that sample returns a random permutation of (1,2,...,10,1,2,...,10).
Another approach is to use:
apply(matrix(sample(nrow(d)), ncol=10), 2, function(idx) d[idx,])