How to read in a CSV file with multiple data sets? - r

Did some research on this and only found information on reading in multiple CSV files.
I'm trying to create a widget where I can read in a CSV file with data sets and print as many graphs as there are data sets.
But I was trying to brainstorm a means of reading in a CSV with multiple data sets inputted vertically. However, I won't know the length of each data set and I won't know how many data sets would be present.
Any ideas or concepts to consider would be appreciated.

# Create sample data
unlink("so-data.csv") # remove it if it exists
set.seed(1492) # reproducible
# make 3 data frames of different lengths
frames <- lapply(c(3, 10, 5), function(n) {
data.frame(X = runif(n), Y1 = runif(n), Y2= runif(n))
})
# write them to single file preserving the header
suppressWarnings(
invisible(
lapply(frames, write.table, file="so-data.csv", sep=",", quote=FALSE,
append=TRUE, row.names=FALSE)
)
)
That file looks like:
"X","Y1","Y2"
0.277646409813315,0.110495456494391,0.852662623859942
0.21606229362078,0.0521760624833405,0.510357670951635
0.184417578391731,0.00824321852996945,0.390395383816212
"X","Y1","Y2"
0.769067857181653,0.916519832098857,0.971386880846694
0.6415081594605,0.63678711745888,0.148033464793116
0.638599780155346,0.381162445060909,0.989824152784422
0.194932354846969,0.132614633999765,0.845784503268078
0.522090089507401,0.599085820373148,0.218151196138933
0.521618122234941,0.0903550288639963,0.983936473494396
0.792095972690731,0.932019826257601,0.703315682942048
0.12338977586478,0.584303047973663,0.421113619813696
0.343668724410236,0.561827397439629,0.111441049026325
0.660837838426232,0.345943035557866,0.0270762923173606
"X","Y1","Y2"
0.309987690066919,0.441982284653932,0.133840701542795
0.747786369873211,0.240106994053349,0.62044994905591
0.789473889162764,0.853503877297044,0.150850139558315
0.165826949058101,0.119402598123997,0.318282842403278
0.39083837531507,0.109747459646314,0.876092307968065
Now you can do:
# read in the data as lines
l <- readLines("so-data.csv")
# figure out where the individual data sets are
starts <- which(grepl("X", l))
ends <- c((starts[2:length(starts)]-1), length(l))
# read them in
new_frames <- mapply(function(start, end) {
read.csv(text=paste0(l[start:end], collapse="\n"), header=TRUE)
}, starts, ends, SIMPLIFY=FALSE)
str(new_frames)
## List of 3
## $ :'data.frame': 3 obs. of 3 variables:
## ..$ X : num [1:3] 0.278 0.216 0.184
## ..$ Y1: num [1:3] 0.1105 0.05218 0.00824
## ..$ Y2: num [1:3] 0.853 0.51 0.39
## $ :'data.frame': 10 obs. of 3 variables:
## ..$ X : num [1:10] 0.769 0.642 0.639 0.195 0.522 ...
## ..$ Y1: num [1:10] 0.917 0.637 0.381 0.133 0.599 ...
## ..$ Y2: num [1:10] 0.971 0.148 0.99 0.846 0.218 ...
## $ :'data.frame': 5 obs. of 3 variables:
## ..$ X : num [1:5] 0.31 0.748 0.789 0.166 0.391
## ..$ Y1: num [1:5] 0.442 0.24 0.854 0.119 0.11
## ..$ Y2: num [1:5] 0.134 0.62 0.151 0.318 0.876

As #Oriol Mirosa mentioned in the comments, this is one way you can do it. You can first read the whole file:
df = read.csv("path", header = TRUE)
Assuming below is how the whole csv file is structured:
df = data.frame(X=c(1:10, "X", 1:20, "X", 1:30),
Y=c(1:10, "Y", 1:20, "Y", 1:30),
Z=c(1:10, "Z", 1:20, "Z", 1:30))
df$newset = ifelse(df$X == "X", 1, 0)
df$newset = as.factor(cumsum(df$newset))
dfs = split(df, df$newset)
dfs[-1] = lapply(dfs[-1], function(x) x[-1,-ncol(x)])
dfs[[1]] = dfs[[1]][,-ncol(dfs[[1]])]
I created a binary variable newset indicating whether a row is a "header". Then, used cumsum to populate each "dataset" with a unique number. I then split() on newset to create a list of datasets with each element containing one. Finally, I removed the first row of each dataset and made them the column names as desired. This should work no matter the length of each dataset.
Result:
# $`0`
# X Y Z
# 1 1 1 1
# 2 2 2 2
# 3 3 3 3
# 4 4 4 4
# 5 5 5 5
# 6 6 6 6
# 7 7 7 7
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
#
# $`1`
# X Y Z
# 12 1 1 1
# 13 2 2 2
# 14 3 3 3
# 15 4 4 4
# 16 5 5 5
# 17 6 6 6
# 18 7 7 7
# 19 8 8 8
# 20 9 9 9
# 21 10 10 10
# 22 11 11 11
# 23 12 12 12
# 24 13 13 13
# 25 14 14 14
# 26 15 15 15
# 27 16 16 16
# 28 17 17 17
# 29 18 18 18
# 30 19 19 19
# 31 20 20 20
#
# $`2`
# X Y Z
# 33 1 1 1
# 34 2 2 2
# 35 3 3 3
# 36 4 4 4
# 37 5 5 5
# 38 6 6 6
# 39 7 7 7
# 40 8 8 8
# 41 9 9 9
# 42 10 10 10
# 43 11 11 11
# 44 12 12 12
# 45 13 13 13
# 46 14 14 14
# 47 15 15 15
# 48 16 16 16
# 49 17 17 17
# 50 18 18 18
# 51 19 19 19
# 52 20 20 20
# 53 21 21 21
# 54 22 22 22
# 55 23 23 23
# 56 24 24 24
# 57 25 25 25
# 58 26 26 26
# 59 27 27 27
# 60 28 28 28
# 61 29 29 29
# 62 30 30 30

Related

Importing CSV of arrays as list

I'm trying to do the following:
I have a .csv file with N rows and 2 columns that I need to import and convert to a list.
Example file from .csv:
First seven rows of data
I import with command: points <- read.csv("points.csv")
'data.frame': 42 obs. of 2 variables:
$ Firefly : int 0 1 0 1 0 1 0 1 0 1 ...
$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073
I need it as a sorted "List of 2" (one for each Firefly) with the following structure:
> str(points)
List of 2
$ : num [1:33] 0.79 0.87 0.88 0.89 0.94 1.01 1.13 1.19 ...
$ : num [1:14] 0.00 0.10 0.56 0.67 1.27 1.31 1.37 1.42 ...
, where the first list represents Firefly == 0 and second list represents Firefly == 1.
I attempt the following:
fy0 <- subset(points,Firefly == 0)
fy1 <- subset(points,Firefly == 1)
points.list <- list(fy0,fy1)
> str(points.list)
List of 2
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 0 0 0 0 0 0 0 0 0 0 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 30 29 28 31 39 40 33 37 25 24 ...
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 1 1 1 1 1 1 1 1 1 1 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 26 32 21 23 20 41 34 22 27 36 ...
I think I need a as.numeric(fy0$Hawkes_times) somewhere, but I want to avoid loops since I will have hundreds of rows and n Firefly values (fy0, fy1, fy2, ... fyn).
Thank you!
-Richard
points <- data.frame(firefly=rep(0:1, times=10), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 1 3 5 7 9 11 13 15 17 19
# $`1`
# [1] 2 4 6 8 10 12 14 16 18 20
This does not rely on equally-sized groups:
set.seed(42)
points <- data.frame(firefly=sample(0:1, size=20, replace=TRUE), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 3 8 11 14 15 18 19
# $`1`
# [1] 1 2 4 5 6 7 9 10 12 13 16 17 20
and as you can see the order is preserved.

Whats wrong with this for loop to calculate the mean of each row?

I try to
-calculate the mean of each row from column 2 to 11 for my dataframe "alpha"
-add the result into column 12 of my dataframe "alpha" which currently has "NA" values
column 1 is "locs"
my df:
[,1][,2][,3][,4][,5][,6][,7][,8][,9][,10][,11][,12]...[,17]
[1,] A1 5 9 4 8 12 4 8 12 4 8 NA NA
[2,] C3 6 10 4 8 12 4 8 12 4 8 NA NA
[3,] P2 7 11 5 6 10 5 6 10 5 6 NA NA
[4,] 4 8 12 5 6 10 5 6 10 5 6 NA NA
[49,] 4 8 12 5 6 10 5 6 10 5 6 NA NA
I am not very familiar with R and I don't understand the problem.
Those are the two different for loops I tried and the warning message:
> for (j in 1:49){
+ alpha[j, 12] <- mean(alpha[j,2:11])
+ }
There were 49 warnings (use warnings() to see them)
>
> for (j in 1:length(locs)) {
+ alpha$mean[j] <- mean(alpha[j,2:11])
+ }
There were 49 warnings (use warnings() to see them)
>
> warnings()
Warnmeldungen:
1: In mean.default(alpha[j, 2:11]) :
Argument ist weder numerisch noch boolesch: gebe NA zurück
2: In mean.default(alpha[j, 2:11]) :
Argument ist weder numerisch noch boolesch: gebe NA zurück
data.frame': 49 obs. of 17 variables:
$ locs: Factor w/ 49 levels "A1","C3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ sum.2009 : num 12 11 12 15 22 18 14 18 8 9 ...
$ sum.2010 : num 14 11 13 18 22 21 15 21 16 17 ...
$ sum.2011 : num 15 12 20 18 26 25 22 18 25 14 ...
$ sum.2012 : num 15 13 17 25 24 20 24 28 26 20 ...
$ sum.2013 : num 14 9 21 21 28 20 14 19 23 21 ...
$ sum.2014 : num 21 16 28 24 32 26 19 22 7 12 ...
$ sum.2015 : num 27 27 31 23 17 6 14 26 19 19 ...
$ sum.2016 : num 18 18 14 23 25 22 24 39 32 15 ...
$ sum.2017 : num 18 18 23 35 22 7 12 27 15 16 ...
$ sum.2018 : num 25 23 25 26 20 11 12 13 7 8 ...
$ mean : num NA NA NA NA NA NA NA NA NA NA ...
Then I converted "locs" from factor to numeric using:
alpha$locs <- as.numeric(alpha$locs)
alpha$locs <- lapply(alpha$locs , as.numeric)
which both worked but I still got the same error messages after running
the for loops.
alpha[1, 2:11] is a data frame with one row, not a vector, and mean doesn't know what to do with a data frame. A better approach would be alpha[, 12] = rowMeans(alpha[, 2:11])
Your approach would work just fine if alpha was a matrix - matrices can only have one data type, so a row or a column can be converted to a vector always. But data frames are all about columns, and columns can have different types. alpha[2:11, 1] is a vector because it's all from one column, and each column is a vector, so it's just part of a vector. But alpha[1, 2:11] spans several columns, and each of the columns might have a different type, so R keeps it as a data frame.
Another approach you could take would be to unlist each row to convert it to a vector, alpha[j, 12] <- mean(unlist(alpha[j,2:11])). This will work, but it will be very slow compared to the rowMeans approach.

how to rbind dataframes with identical column names in a list

I have a list like this:
list=list(
df1=read.table(text = "a b c
11 14 20
17 15 12
6 19 17
",header=T),
df2=read.table(text = "a b c
6 19 12
9 7 19
",header=T),
df3=read.table(text = "a d f
12 20 15
12 10 8
7 8 7
",header=T),
df4=read.table(text = "g f e z
5 12 11 5
16 17 20 16
19 9 11 20
",header=T),
df5=read.table(text = "g f e z
15 13 9 18
12 12 17 16
15 9 12 11
15 20 19 15
",header=T),
df6=read.table(text = "a d f
11 7 16
11 12 11
",header=T)
)
my list contains different dataframes. based on the column names there are 3 types of dataframe in the list.
type1:df1 and df2
type2:df3 and df6
type3:f4 and df5
I am going to rbind dataframes with identical column names and save the result in new list. for the example df1 with df2, df3 with df6, and df4 with df5 have identical column names.I need a code that automatically identify and rbind dataframes with identical column names.
the following list is expected as result:
> new list
$df1.df2
a b c
1 11 14 20
2 17 15 12
3 6 19 17
4 6 19 12
5 9 7 19
$df3.df6
a d f
1 12 20 15
2 12 10 8
3 7 8 7
4 11 7 16
5 11 12 11
$df4.df5
g f e z
1 5 12 11 5
2 16 17 20 16
3 19 9 11 20
4 15 13 9 18
5 12 12 17 16
6 15 9 12 11
7 15 20 19 15
the name of dataframe in new list could be anything.
We can
library(tidyverse)
library(janitor)
bind_rows(dfls) %>%
mutate(code= apply(apply(., 2, function(x){
ifelse(is.na(x), 1, 2)}), 1, paste, collapse="")) %>%
nest(.,-code, .key="code") %>%
mutate(filtered = map(code, janitor::remove_empty_cols)) %>%
pull(filtered) -> out
glimpse(out)
# List of 3
# $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 3 variables:
# ..$ a: int [1:5] 11 17 6 6 9
# ..$ b: int [1:5] 14 15 19 19 7
# ..$ c: int [1:5] 20 12 17 12 19
# $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 3 variables:
# ..$ a: int [1:5] 12 12 7 11 11
# ..$ d: int [1:5] 20 10 8 7 12
# ..$ f: int [1:5] 15 8 7 16 11
# $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 7 obs. of 4 variables:
# ..$ f: int [1:7] 12 17 9 13 12 9 20
# ..$ g: int [1:7] 5 16 19 15 12 15 15
# ..$ e: int [1:7] 11 20 11 9 17 12 19
# ..$ z: int [1:7] 5 16 20 18 16 11 15
Because I don't like naming a variable list, I'm naming your data as l.
lapply(
split(l, sapply(l, function(a) paste(colnames(a), collapse = "_"))),
dplyr::bind_rows)
# $a_b_c
# a b c
# 1 11 14 20
# 2 17 15 12
# 3 6 19 17
# 4 6 19 12
# 5 9 7 19
# $a_d_f
# a d f
# 1 12 20 15
# 2 12 10 8
# 3 7 8 7
# 4 11 7 16
# 5 11 12 11
# $g_f_e_z
# g f e z
# 1 5 12 11 5
# 2 16 17 20 16
# 3 19 9 11 20
# 4 15 13 9 18
# 5 12 12 17 16
# 6 15 9 12 11
# 7 15 20 19 15
I would generally prefer to use by(data, INDICES, FUN) to lapply(split(data, INDICES), FUN), but for some reason it kept complaining ... so the above.
The choice to concatenate column names collapsing with _ was arbitrary, intending to find a simple "hashing" of them; it's not hard to contrive a situation where this method finds two frames similar when they are not ... perhaps it's unlikely enough to not be a concern.
I should also note that I'm using dplyr::bind_rows, but nothing else from dplyr. This could easily be converted into something using purrr:: or perhaps other tidy-package groupings.

Merge and fill different length data in R

I'm using R and need merge data with different lenghts
Following this dataset
> means2012
# A tibble: 232 x 2
exporter eci
<fct> <dbl>
1 ABW 0.235
2 AFG -0.850
3 AGO -1.40
4 AIA 1.34
5 ALB -0.480
6 AND 1.22
7 ANS 0.662
8 ARE 0.289
9 ARG 0.176
10 ARM 0.490
# ... with 222 more rows
> means2013
# A tibble: 234 x 2
exporter eci
<fct> <dbl>
1 ABW 0.534
2 AFG -0.834
3 AGO -1.26
4 AIA 1.47
5 ALB -0.498
6 AND 1.13
7 ANS 0.616
8 ARE 0.267
9 ARG 0.127
10 ARM 0.0616
# ... with 224 more rows
> str(means2012)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 232 obs. of 2 variables:
$ exporter: Factor w/ 242 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 9 10 11 ...
$ eci : num 0.235 -0.85 -1.404 1.337 -0.48 ...
> str(means2013)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 2 variables:
$ exporter: Factor w/ 242 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 9 10 11 ...
$ eci : num 0.534 -0.834 -1.263 1.471 -0.498 ...
Note that 2 tibble has different lenghts. "Exporter" are countries.
Is there any way to merge both tibble, looking to the factors (Exporter) and fill the missing it with "na"?
It doesn't matter if is a tibble, dataframe, or other kind.
like this:
tibble 1
a 5
b 10
c 15
d 25
tibble 2
a 7
c 23
d 20
merged one:
a 5 7
b 10 na
c 15 23
d 25 20
using merge with parameter all set to TRUE:
tibble1 <- read.table(text="
x y
a 5
b 10
c 15
d 25",header=TRUE,stringsAsFactors=FALSE)
tibble2 <- read.table(text="
x z
a 7
c 23
d 20",header=TRUE,stringsAsFactors=FALSE)
merge(tibble1,tibble2,all=TRUE)
x y z
1 a 5 7
2 b 10 NA
3 c 15 23
4 d 25 20
Or dplyr::full_join(tibble1,tibble2) for the same effect
You could rename the colums to join them, and get NA where the other value is missing.
library(tidyverse)
means2012 %>%
rename(eci2012 = eci) %>%
full_join(means2013 %>%
rename(eci2013 = eci))
But a tidier approach would be to add a year column, keep the column eci as is and just bind the rows together.
means2012 %>%
mutate(year = 2012) %>%
bind_rows(means2013 %>%
mutate(year = 2013))

Applying IF statement through multiple files in a list

I am attempting to apply an IF statement through a large list of 64 items. My data takes the following form:
file_list Large list (64 elements, 4.2 Mb)
file1: 'data.frame': 3012 obs. of 4 variables:
..$V1: int[1:3012] 1850 1850 1850 ...
..$V2: int[1:3012] 1 2 3 ...
..$V3: int[1:3012] 16 15 16 ...
..$V4: int[1:3012] 4.69E-05 6.99E-05 5.62E-05 ...
................................................................................
file64: 'data.frame': 5412 obs. of 4 variables:
..$V1: int[1:5412] 1850 1850 1850 ...
..$V2: int[1:5412] 1 2 3 ...
..$V3: int[1:5412] 16 15 16 ...
..$V4: int[1:5412] 6.96E-05 4.99E-05 5.37E-05 ...
What I want to do is multiply the fourth column ($V4) in each of the 64 files by a different number depending on the contents of the second column ($V2). The numbers in $V2 are months of the year, and I need to multiply $V4 by 31 when $V2 is 1, 3, 5, 7, 8, 10 and 12; 30 when $V2 is 4, 6, 9 and 11; and 28.25 when $V2 is 2.
I assume this will involve some sort of for loop, but I haven't been able to complete this task. Any suggestions?
Here's a reproducible solution that uses a small function:
file_list <- list(file1 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)),
file2 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)))
str(file_list)
# List of 2
# $ file1:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 6 90 66 86 32 33 50 46 19 59 ...
# ..$ v2: num [1:100] 5 10 2 10 8 6 10 3 5 5 ...
# ..$ v4: num [1:100] -0.639 -2.234 -0.816 0.997 -0.302 ...
# $ file2:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 34 25 24 4 100 59 80 100 21 97 ...
# ..$ v2: num [1:100] 3 6 8 8 9 1 8 1 3 3 ...
# ..$ v4: num [1:100] -2.2599 0.0548 -1.1666 -0.4049 0.4681 ...
myFun <- function(df) {
df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] <- df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] * 31
df$v4[df$v2 %in% c(4,6,9,11)] <- df$v4[df$v2 %in% c(4,6,9,11)] * 30
df$v4[df$v2 == 2] <- df$v4[df$v2 == 2] * 28.25
df
}
lapply(file_list, myFun)
# lapply(file_list, FUN = function(x) head(myFun(x)))
# $file1
# v1 v2 v4
# 1 6 5 -19.816836
# 2 90 10 -69.264329
# 3 66 2 -23.054110
# 4 86 10 30.910798
# 5 32 8 -9.347289
# 6 33 6 -16.316746
#
# $file2
# v1 v2 v4
# 1 34 3 -70.055942
# 2 25 6 1.642744
# 3 24 8 -36.165864
# 4 4 8 -12.550877
# 5 100 9 14.041857
# 6 59 1 -2.556662

Resources