With a list of sequences, for example,
datList <- list(One = seq(1,5, length.out = 20),
Two = seq(1,10, length.out = 20),
Three = seq(5,50, length.out = 20))
Is it possible to make a data frame so that the sequences are converted into columns. As in,
datDF <- data.frame(One = datList[[1]], Two = datList[[2]], Three = datList[[3]] )
> head(datDF)
One Two Three
1 1.000000 1.000000 5.000000
2 1.210526 1.473684 7.368421
3 1.421053 1.947368 9.736842
4 1.631579 2.421053 12.105263
5 1.842105 2.894737 14.473684
6 2.052632 3.368421 16.842105
Within the context of my 'real' data I am working with many sequences and was hoping an apply (or similar) function could be used rather than manually creating the desired data frame.
We can use data.frame
datDFN <- data.frame(datList)
identical(datDF, datDFN)
#[1] TRUE
Related
I have a data.frame in R, containing several categorical variables, each with its own mean and standard deviation. I want to generate values from a normal data distribution for each categorical variable defined by these values and generate individual data.frames for each discrete categorical variable.
Here's some dummy data
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
dummy_data
VARIABLE MEAN SD
1 A 6.278751 1.937093
2 B 6.384247 2.487678
3 C 9.017496 2.003202
4 D 5.125994 1.829517
5 E 9.525213 1.914513
6 F 9.004893 2.734934
7 G 9.780757 2.511341
8 H 5.372160 1.510281
9 I 6.240331 2.796826
10 J 8.478280 2.325139
What I'd like to do from here, is to generate individual data.frames for each row, with each data.frame containing a normal distribution based on the MEAN and SD columns.
So, for example, I'd have a separate data.frame that contained....
A <- subset(dummy_data, VARIABLE == 'A')
A <- data.frame(rnorm(20, A$MEAN, A$SD))
A
rnorm.20..A.MEAN..A.SD.
1 5.131331
2 9.388104
3 8.909453
4 5.813257
5 5.353137
6 7.598521
7 2.693924
8 5.425703
9 8.939687
10 9.148066
11 4.528936
12 7.576479
13 8.207456
14 6.838258
15 6.972061
16 7.824283
17 6.283434
18 4.503815
19 2.133388
20 7.472886
The real data I'm working with is much larger than ten rows, and so I don't want to subset the whole thing to generate the individual data.frames if I can help it.
Thanks in advance
What about a solution using dplyr?:
library(dplyr)
#A dataframe containing all the information
Huge_df <- dummy_data %>% group_by(VARIABLE) %>% summarise(SIMULATED = rnorm(20, MEAN, SD))
#You can then split the dataframe if needed:
Splitted <- split.data.frame(Huge_df, "VARIABLE")
If you then need to save every individual dataframe, or do something else with them, you can always unlist the Splitted object
Using data.table:
library(data.table)
result <- setDT(dummy_data)[, .(sample=rnorm(20, mean=MEAN, sd=SD)), by=.(VARIABLE)]
list.of.df <- split(result, result$VARIABLE)
You can put everything into a list, then return all the elements in the list to the global environment (if desired, or keep in the list):
set.seed(123)
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
# put all the values into a list
list_dist <- vector(mode = "list", length = nrow(dummy_data))
for(i in 1:nrow(dummy_data)){
list_dist[[i]] <- data.frame(values = rnorm(20, dummy_data[i,2], dummy_data[i,3]))
}
# name the list elements
names(list_dist) <- dummy_data$VARIABLE
# or more detailed names, for instance,
# names(list_dist) <- paste0(dummy_data$VARIABLE, "_Distribution")
#return all list values to the global environment
list2env(list_dist,globalenv())
I am VERY new to loops. Some of my loops have been successful, others... not so much.
I have some observed data (df_obs) that I'd like to test against my model predictions (df_pred).
MY CURRENT AIM: write a loop which makes a list of data frames, so that can I use this list in future loops assessing model performance. I will probably be back for help with THOSE loops...
YES: I do want a list of data frames. I'm working with 50+ species and have a bunch of tests to run on these values.
MAYBE: I think I want a for() loop, but if a different method is easier e.g. lapply(), I'm open to suggestions.
I've done my best to create a reproducible data set and code that mimics what I am working with:
#observed presence (1) and absence (0)
set.seed(733)
df_obs <- data.frame(plot = 1:10,
sp1 = sample(0:1, 10, replace = TRUE),
sp2 = sample(0:1, 10, replace = TRUE),
sp3 = sample(0:1, 10, replace = TRUE))
#predicted probability of occurrence (ranges from 0 to 1)
set.seed(733)
df_preds <- data.frame(plot = 1:10,
sp1 = runif(10, 0, 1),
sp2 = runif(10, 0, 1),
sp3 = runif(10, 0, 1))
sppcodes <- c("sp1", "sp2", "sp3")
test.eval.list <- vector("list", length = length(sppcodes))
names(test.eval.list) <- sppcodes
for(i in seq_along(sppcodes)){
sppn <- sppcodes[i]
plot = df_obs$plot
obs = df_obs[,sppn]
pred = df_preds[,sppn]
df <- data.frame(plot, obs, pred) #produces dataframe as expected
test.eval.list[sppn] <- df #problem seems to be here, it ends up assigning a vector of numbers...
}
Could someone please help me understand why I am not ending up with a list of data frames, and give a correct way of doing so?
Please note - I know there are areas which could be done in a single line of code, I prefer this way of spreading the code out to understand which parts are/are not working.
You had a small mistake in for loop. You had to use [[ instead of [ while accessing the list. You may want to read up ?Extract if you are interested in different ways of accessing elements.
for(i in seq_along(sppcodes)){
sppn <- sppcodes[i]
plot = df_obs$plot
obs = df_obs[,sppn]
pred = df_preds[,sppn]
df <- data.frame(plot, obs, pred)
test.eval.list[[sppn]] <- df
}
However, an alternative is using Map
Map(cbind.data.frame, plot = list(df_obs$plot),obs=df_obs[-1],pred = df_preds[-1])
#[[1]]
# plot obs pred
#1 1 1 0.3266487
#2 2 1 0.3745092
#3 3 0 0.8633161
#4 4 0 0.1970302
#5 5 1 0.3017755
#6 6 0 0.9154151
#7 7 0 0.6193044
#8 8 0 0.4020479
#9 9 1 0.9947362
#10 10 1 0.7975380
#...
#....
I want to multiply all the values in columns e.g. by 5, and then save the results into a new dataset, without changing the data being read in.
Using a loop I use the following R code:
raw_data[,i]<-raw_data[,i]*5
What I want is to keep the original data as it is, raw_data, and save the multiplied data into e.g. new_data:
new_data[,i]<-raw_data[,i]*5
I get an error saying the object 'new_data' is not found.
Is there a neat way of doing this, or do you have to create the new_data object first as an empty dataset?
No need for loops here.
# a toy data frame
raw_data <- data.frame(x = 1:2, y = 3:4, z = 5:6)
# same applies if you have your data in a matrix
# raw_data <- matrix(1:6, ncol = 3)
raw_data
# x y z
# 1 1 3 5
# 2 2 4 6
new_data <- raw_data * 5
new_data
# x y z
# 1 5 15 25
# 2 10 20 30
I am wondering if it is possible to create a new dataframe with certain cells from each file from the working directory. for example say If I have 2 data frame like this (please ignore the numbers as they are random):
Say in each dataset, row 4 is the sum of my value and Row 5 is number of missing values. If I represent number of missing values as "M" and Sum of coloumns as "N", what I am trying to acheive is the following table:
So each file 'N' and 'M' are in 1 single row.
I have many files in the directory so I have read them in a list, but not sure what would be the best way to perform such task on a list of files.
this is my sample code for the tables I have shown and how I read them in list:
##Create sample data
df = data.frame(Type = 'wind', v1=c(1,2,3,100,50), v2=c(4,5,6,200,60), v3=c(6,7,8,300,70))
df2 =data.frame(Type = 'test', v1=c(3,2,1,400,40), v2=c(2,3,4,500,30), v3=c(6,7,8,600,20))
# write to directory
write.csv(df, file = "sample1.csv", row.names = F)
write.csv(df2, file = "sample2.csv", row.names = F)
# read to list
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = TRUE)
I would be really greatful if you could give me some suggestion about if this possible and how I should approch?
Many thanks,
Ayan
This should work:
processFile <- function(File) {
d <- read.csv(File, skip = 4, nrows = 2, header = FALSE,
stringsAsFactors = FALSE)
dd <- data.frame(d[1,1], t(unlist(d[-1])))
names(dd) <- c("ID", "v1N", "V1M", "v2N", "V2M", "v3N", "V3M")
return(dd)
}
ll <- lapply(mycsv, processFile)
do.call(rbind, ll)
# ID v1N V1M v2N V2M v3N V3M
# 1 wind 100 50 200 60 300 70
# 2 test 400 40 500 30 600 20
(The one slightly tricky/unusual bit comes in that third line of processFile(). Here's a code snippet that should help you see how it accomplishes what it does.)
(d <- data.frame(a="wind", b=1:2, c=3:4))
# a b c
# 1 wind 1 3
# 2 wind 2 4
t(unlist(d[-1]))
# b1 b2 c1 c2
# [1,] 1 2 3 4
CAVEAT: I'm not sure I fully understand what you want. I think you're reading in a list and want to select certain dataframes from that list with the same rows from that list. Then you want to create a data frame of those rows and go from long to wide format.
LIST <- lapply(2:3, function(i) {
x <- mylist[[i]][4:5, ]
x <- data.frame(x, row = factor(rownames(x)))
return(x)
}
)
DF <- do.call("rbind", LIST) #lets you bind an unknown number of rows from a list
levels(DF$row) <- list(M =4, N = 5) #recodes rows 4 and 5 with M and N
wide <- reshape(DF, v.names=c("v1", "v2", "v3"), idvar=c("Type"),
timevar="row", direction="wide") #reshape from long to wide
rownames(wide) <- 1:nrow(wide) #give proper row names
wide
This yields:
Type v1.M v2.M v3.M v1.N v2.N v3.N
1 wind 100 200 300 50 60 70
2 test 400 500 600 40 30 20
I have an integer as a column that I would like to split into multiple, seperate integers
Creating a list of dataframes using split() doesn't work for my later purposes
df <- as.data.frame(runif(n = 10000, min = 1, max = 10))
where split() creates a list of dataframe which I can't use for further purposes, where I need a separate integer as "Values"
map.split <- split(df, (as.numeric(rownames(df)) - 1) %/% 250) # this is not the trick
My goal is to split the column into different integer (not saved under the Global Environment "Data", but "Values")
This would be the slow way:
VecList1 <- df[1:250,]
VecList2 <- df[251:500,]
with
str(VecList1)
Int [1:250] 1 1 10 5 3 ....
Any advice welcome
If I'm interpreting correctly (not clear to me), here's a reduced problem and what I think you're asking for.
set.seed(2)
df <- data.frame(x = runif(10, min = 1, max = 10))
df$Values <- (seq_len(nrow(df))-1) %/% 4
df
# x Values
# 1 2.663940 0
# 2 7.321366 0
# 3 6.159937 0
# 4 2.512467 0
# 5 9.494554 1
# 6 9.491275 1
# 7 2.162431 1
# 8 8.501039 1
# 9 5.212167 2
# 10 5.949854 2
If all you need is that Values column as its own object, then you can just change df$Values <- ... to Values <- ....
Here's one way of doing this (although it's probably better to figure out a way where you don't need a series of separate vectors, but rather work with columns in a single matrix):
df <- data.frame(a=runif(n = 10000, min = 1, max = 10))
mx<-matrix(df$a,nrow=250)
for (i in 1:NCOL(mx)) {
assign(paste0("VecList",i),mx[,i])}
Note: using assign is generally not advisable. Whatever it is you're trying to achieve, there's probably a better way of doing it without creating a series of new vectors in the global environment.