I am attempting to apply an IF statement through a large list of 64 items. My data takes the following form:
file_list Large list (64 elements, 4.2 Mb)
file1: 'data.frame': 3012 obs. of 4 variables:
..$V1: int[1:3012] 1850 1850 1850 ...
..$V2: int[1:3012] 1 2 3 ...
..$V3: int[1:3012] 16 15 16 ...
..$V4: int[1:3012] 4.69E-05 6.99E-05 5.62E-05 ...
................................................................................
file64: 'data.frame': 5412 obs. of 4 variables:
..$V1: int[1:5412] 1850 1850 1850 ...
..$V2: int[1:5412] 1 2 3 ...
..$V3: int[1:5412] 16 15 16 ...
..$V4: int[1:5412] 6.96E-05 4.99E-05 5.37E-05 ...
What I want to do is multiply the fourth column ($V4) in each of the 64 files by a different number depending on the contents of the second column ($V2). The numbers in $V2 are months of the year, and I need to multiply $V4 by 31 when $V2 is 1, 3, 5, 7, 8, 10 and 12; 30 when $V2 is 4, 6, 9 and 11; and 28.25 when $V2 is 2.
I assume this will involve some sort of for loop, but I haven't been able to complete this task. Any suggestions?
Here's a reproducible solution that uses a small function:
file_list <- list(file1 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)),
file2 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)))
str(file_list)
# List of 2
# $ file1:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 6 90 66 86 32 33 50 46 19 59 ...
# ..$ v2: num [1:100] 5 10 2 10 8 6 10 3 5 5 ...
# ..$ v4: num [1:100] -0.639 -2.234 -0.816 0.997 -0.302 ...
# $ file2:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 34 25 24 4 100 59 80 100 21 97 ...
# ..$ v2: num [1:100] 3 6 8 8 9 1 8 1 3 3 ...
# ..$ v4: num [1:100] -2.2599 0.0548 -1.1666 -0.4049 0.4681 ...
myFun <- function(df) {
df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] <- df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] * 31
df$v4[df$v2 %in% c(4,6,9,11)] <- df$v4[df$v2 %in% c(4,6,9,11)] * 30
df$v4[df$v2 == 2] <- df$v4[df$v2 == 2] * 28.25
df
}
lapply(file_list, myFun)
# lapply(file_list, FUN = function(x) head(myFun(x)))
# $file1
# v1 v2 v4
# 1 6 5 -19.816836
# 2 90 10 -69.264329
# 3 66 2 -23.054110
# 4 86 10 30.910798
# 5 32 8 -9.347289
# 6 33 6 -16.316746
#
# $file2
# v1 v2 v4
# 1 34 3 -70.055942
# 2 25 6 1.642744
# 3 24 8 -36.165864
# 4 4 8 -12.550877
# 5 100 9 14.041857
# 6 59 1 -2.556662
Related
This should be pretty easy, but I dont know how. I have a single dataframe and a list with two dataframes. Now I want to combine them together, so that I have a single list with three dataframes. And I do not want to do in "manually".
a = data.frame(xa = 1:10,
ya = 11:20)
b = list(c = data.frame(x = 1:10),
d = data.frame(x = 1:20,
y = 11:30))
Now I though about something like this:
res = c(a, b)
But this results in this:
> sapply(res, class)
xa ya c d
"integer" "integer" "data.frame" "data.frame"
So it turns the two columns of the single dataframe into a vector.
How could I maintain the dataframe structure for the "single" dataframe and extract the dataframes from the list of 2?
You can use c but you have to cover your data.frame a into a list.
res <- c(b, list(a=a))
str(res)
#List of 3
# $ c:'data.frame': 10 obs. of 1 variable:
# ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
# $ d:'data.frame': 20 obs. of 2 variables:
# ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ y: int [1:20] 11 12 13 14 15 16 17 18 19 20 ...
# $ a:'data.frame': 10 obs. of 2 variables:
# ..$ xa: int [1:10] 1 2 3 4 5 6 7 8 9 10
# ..$ ya: int [1:10] 11 12 13 14 15 16 17 18 19 20
You can always add it as a new element
b[["a"]]=a
The "a" can be used in a loop or something similar.
I'm trying to do the following:
I have a .csv file with N rows and 2 columns that I need to import and convert to a list.
Example file from .csv:
First seven rows of data
I import with command: points <- read.csv("points.csv")
'data.frame': 42 obs. of 2 variables:
$ Firefly : int 0 1 0 1 0 1 0 1 0 1 ...
$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073
I need it as a sorted "List of 2" (one for each Firefly) with the following structure:
> str(points)
List of 2
$ : num [1:33] 0.79 0.87 0.88 0.89 0.94 1.01 1.13 1.19 ...
$ : num [1:14] 0.00 0.10 0.56 0.67 1.27 1.31 1.37 1.42 ...
, where the first list represents Firefly == 0 and second list represents Firefly == 1.
I attempt the following:
fy0 <- subset(points,Firefly == 0)
fy1 <- subset(points,Firefly == 1)
points.list <- list(fy0,fy1)
> str(points.list)
List of 2
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 0 0 0 0 0 0 0 0 0 0 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 30 29 28 31 39 40 33 37 25 24 ...
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 1 1 1 1 1 1 1 1 1 1 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 26 32 21 23 20 41 34 22 27 36 ...
I think I need a as.numeric(fy0$Hawkes_times) somewhere, but I want to avoid loops since I will have hundreds of rows and n Firefly values (fy0, fy1, fy2, ... fyn).
Thank you!
-Richard
points <- data.frame(firefly=rep(0:1, times=10), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 1 3 5 7 9 11 13 15 17 19
# $`1`
# [1] 2 4 6 8 10 12 14 16 18 20
This does not rely on equally-sized groups:
set.seed(42)
points <- data.frame(firefly=sample(0:1, size=20, replace=TRUE), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 3 8 11 14 15 18 19
# $`1`
# [1] 1 2 4 5 6 7 9 10 12 13 16 17 20
and as you can see the order is preserved.
Did some research on this and only found information on reading in multiple CSV files.
I'm trying to create a widget where I can read in a CSV file with data sets and print as many graphs as there are data sets.
But I was trying to brainstorm a means of reading in a CSV with multiple data sets inputted vertically. However, I won't know the length of each data set and I won't know how many data sets would be present.
Any ideas or concepts to consider would be appreciated.
# Create sample data
unlink("so-data.csv") # remove it if it exists
set.seed(1492) # reproducible
# make 3 data frames of different lengths
frames <- lapply(c(3, 10, 5), function(n) {
data.frame(X = runif(n), Y1 = runif(n), Y2= runif(n))
})
# write them to single file preserving the header
suppressWarnings(
invisible(
lapply(frames, write.table, file="so-data.csv", sep=",", quote=FALSE,
append=TRUE, row.names=FALSE)
)
)
That file looks like:
"X","Y1","Y2"
0.277646409813315,0.110495456494391,0.852662623859942
0.21606229362078,0.0521760624833405,0.510357670951635
0.184417578391731,0.00824321852996945,0.390395383816212
"X","Y1","Y2"
0.769067857181653,0.916519832098857,0.971386880846694
0.6415081594605,0.63678711745888,0.148033464793116
0.638599780155346,0.381162445060909,0.989824152784422
0.194932354846969,0.132614633999765,0.845784503268078
0.522090089507401,0.599085820373148,0.218151196138933
0.521618122234941,0.0903550288639963,0.983936473494396
0.792095972690731,0.932019826257601,0.703315682942048
0.12338977586478,0.584303047973663,0.421113619813696
0.343668724410236,0.561827397439629,0.111441049026325
0.660837838426232,0.345943035557866,0.0270762923173606
"X","Y1","Y2"
0.309987690066919,0.441982284653932,0.133840701542795
0.747786369873211,0.240106994053349,0.62044994905591
0.789473889162764,0.853503877297044,0.150850139558315
0.165826949058101,0.119402598123997,0.318282842403278
0.39083837531507,0.109747459646314,0.876092307968065
Now you can do:
# read in the data as lines
l <- readLines("so-data.csv")
# figure out where the individual data sets are
starts <- which(grepl("X", l))
ends <- c((starts[2:length(starts)]-1), length(l))
# read them in
new_frames <- mapply(function(start, end) {
read.csv(text=paste0(l[start:end], collapse="\n"), header=TRUE)
}, starts, ends, SIMPLIFY=FALSE)
str(new_frames)
## List of 3
## $ :'data.frame': 3 obs. of 3 variables:
## ..$ X : num [1:3] 0.278 0.216 0.184
## ..$ Y1: num [1:3] 0.1105 0.05218 0.00824
## ..$ Y2: num [1:3] 0.853 0.51 0.39
## $ :'data.frame': 10 obs. of 3 variables:
## ..$ X : num [1:10] 0.769 0.642 0.639 0.195 0.522 ...
## ..$ Y1: num [1:10] 0.917 0.637 0.381 0.133 0.599 ...
## ..$ Y2: num [1:10] 0.971 0.148 0.99 0.846 0.218 ...
## $ :'data.frame': 5 obs. of 3 variables:
## ..$ X : num [1:5] 0.31 0.748 0.789 0.166 0.391
## ..$ Y1: num [1:5] 0.442 0.24 0.854 0.119 0.11
## ..$ Y2: num [1:5] 0.134 0.62 0.151 0.318 0.876
As #Oriol Mirosa mentioned in the comments, this is one way you can do it. You can first read the whole file:
df = read.csv("path", header = TRUE)
Assuming below is how the whole csv file is structured:
df = data.frame(X=c(1:10, "X", 1:20, "X", 1:30),
Y=c(1:10, "Y", 1:20, "Y", 1:30),
Z=c(1:10, "Z", 1:20, "Z", 1:30))
df$newset = ifelse(df$X == "X", 1, 0)
df$newset = as.factor(cumsum(df$newset))
dfs = split(df, df$newset)
dfs[-1] = lapply(dfs[-1], function(x) x[-1,-ncol(x)])
dfs[[1]] = dfs[[1]][,-ncol(dfs[[1]])]
I created a binary variable newset indicating whether a row is a "header". Then, used cumsum to populate each "dataset" with a unique number. I then split() on newset to create a list of datasets with each element containing one. Finally, I removed the first row of each dataset and made them the column names as desired. This should work no matter the length of each dataset.
Result:
# $`0`
# X Y Z
# 1 1 1 1
# 2 2 2 2
# 3 3 3 3
# 4 4 4 4
# 5 5 5 5
# 6 6 6 6
# 7 7 7 7
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
#
# $`1`
# X Y Z
# 12 1 1 1
# 13 2 2 2
# 14 3 3 3
# 15 4 4 4
# 16 5 5 5
# 17 6 6 6
# 18 7 7 7
# 19 8 8 8
# 20 9 9 9
# 21 10 10 10
# 22 11 11 11
# 23 12 12 12
# 24 13 13 13
# 25 14 14 14
# 26 15 15 15
# 27 16 16 16
# 28 17 17 17
# 29 18 18 18
# 30 19 19 19
# 31 20 20 20
#
# $`2`
# X Y Z
# 33 1 1 1
# 34 2 2 2
# 35 3 3 3
# 36 4 4 4
# 37 5 5 5
# 38 6 6 6
# 39 7 7 7
# 40 8 8 8
# 41 9 9 9
# 42 10 10 10
# 43 11 11 11
# 44 12 12 12
# 45 13 13 13
# 46 14 14 14
# 47 15 15 15
# 48 16 16 16
# 49 17 17 17
# 50 18 18 18
# 51 19 19 19
# 52 20 20 20
# 53 21 21 21
# 54 22 22 22
# 55 23 23 23
# 56 24 24 24
# 57 25 25 25
# 58 26 26 26
# 59 27 27 27
# 60 28 28 28
# 61 29 29 29
# 62 30 30 30
I've written a simulation function in R. I'd like to do num simulations. Rather than using a for loop, I'm trying to use some sort of apply function, such as lapply or parallel::mclapply.
lapply, as I'm currently using it, is failing.
For example:
# t1() is a generic example function
t1 <- function() {data(cars); return(get("cars"))}
a <- t1() # works
a2 <- vector("list", 5) # pre-allocate list for 5 simulations
# otherwise: a2 <- vector("list", num) # where num was pre-specified
a2 <- lapply(a2, t1)
## Error in FUN(X[[1L]], ...) : unused argument (X[[1]])
What am I doing wrong? Thanks in advance!
I'd rather not need to do:
a2 <- vector("list", 5)
for (i in 1:5) {
a2[[i]] <- t1()
}
It's true that a <- t1() works but it's not true that a <- t1(2) would have "worked". You are trying to pass arguments to parameters that are not there. Put a dummy parameter in the argument list and all will be fine. You might also look at the replicate function. It is specifically designed to support simulation efforts. I think you will find that it does not require including dummy parameters in the argument list.
> t1 <- function(z) {data(cars); return(get("cars"))}
> a <- t1() # works
> a2 <- vector("list", 5) # pre-allocate list for 5 simulations
> # otherwise: a2 <- vector("list", num) # where num was pre-specified
> a2 <- lapply(a2, t1) ;str(a2)
List of 5
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
>
I have a matrix 500 row , 1000 cols. every col has 4 elements between them comma, I need to remove the comma.
the data looks like that.
1 2 3 4 ... 1000
1 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
2 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
3 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
.
.
500 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
My code is
mat=matrix(data=NA, nrow=257, ncol=3)
n=1000
k=500
for(i in 1:n){
mat[i]<-colsplit(as.character(data[,i]), "," , c("a","b","c"))
}
Not working, there is missing in my loop.
Could anyone help me to figure it out, Thanks
If you want to create new columns based on , as delimiter
library(data.table)
library(splitstackshape)
df1 <- cSplit(df, 1:ncol(df), sep=",")[,lapply(.SD, as.numeric)]
df1
# X1_1 X1_2 X1_3 X2_1 X2_2 X2_3 X3_1 X3_2 X3_3 X4_1 X4_2 X4_3
#1: 12 1 20 14 15 12 10 10 20 1 0 10
#2: 12 1 20 14 15 12 10 10 20 1 0 10
#3: 12 1 20 14 15 12 10 10 20 1 0 10
Or use cSplit_f which would be faster for rectangular data (based on comments from the author of splitstackshape package (#Ananda Mahto)
cSplit_f(df, 1:ncol(df), sep=",")[,lapply(.SD, as.numeric)]
str(df1)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 12 variables:
# $ X1_1: num 12 12 12
# $ X1_2: num 1 1 1
# $ X1_3: num 20 20 20
# $ X2_1: num 14 14 14
# $ X2_2: num 15 15 15
# $ X2_3: num 12 12 12
# $ X3_1: num 10 10 10
# $ X3_2: num 10 10 10
# $ X3_3: num 20 20 20
# $ X4_1: num 1 1 1
# $ X4_2: num 0 0 0
# $ X4_3: num 10 10 10
data
df <- structure(list(X1 = c("12,1,20", "12,1,20", "12,1,20"), X2 = c("14,15,12",
"14,15,12", "14,15,12"), X3 = c("10,10,20", "10,10,20", "10,10,20"
), X4 = c("1,0,10", "1,0,10", "1,0,10")), .Names = c("X1", "X2",
"X3", "X4"), class = "data.frame", row.names = c("1", "2", "3"
))