I would like to take the difference between columns 1 & 2, 3 & 4, 5 & 6, 7 & 8, and so on.
I originally had 55 corresponding column pairs (110 columns total) and needed to get 55 difference columns. I ended up coding each column difference by hand, but I thought I could probably do this much more efficiently. Perhaps by the use of arrays in SAS. I would like to solve this problem in r as well.
Synthetic data is below and if anyone knows how to quickly generate sequential paired column names like var1_apple, var1_banana, var2_apple, var2_banana, var3_apple, var3_banana,..., in r (without just typing out a vector of column names) that would be very helpful as well.
Thank you!
## create a dataframe with random values of 1:10. ncols x nrows = 200
df <- data.frame(matrix(sample(1:10, 200, replace = TRUE), ncol = 20, nrow = 10))
EDIT -- added the "55 difference columns" part at the bottom.
Adjusting data to be column pairs:
df <- data.frame(matrix(sample(1:10, 200, replace = TRUE), ncol = 20, nrow = 10))
names(df) <- paste0("var", rep(1:10, each = 2), "_", rep(c("apple", "banana")))
names(df)
[1] "var1_apple" "var1_banana" "var2_apple" "var2_banana" "var3_apple" "var3_banana"
[7] "var4_apple" "var4_banana" "var5_apple" "var5_banana" "var6_apple" "var6_banana"
[13] "var7_apple" "var7_banana" "var8_apple" "var8_banana" "var9_apple" "var9_banana"
[19] "var10_apple" "var10_banana"
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("var", ".value"), names_sep = "_")
# A tibble: 100 × 4
row var apple banana
<int> <chr> <int> <int>
1 1 var1 8 7
2 1 var2 4 9
3 1 var3 7 3
4 1 var4 6 10
5 1 var5 10 10
6 1 var6 1 1
7 1 var7 2 10
8 1 var8 7 9
9 1 var9 3 8
10 1 var10 2 6
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
Here a variation to add all the difference columns interspersed:
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("var", ".value"), names_sep = "_") %>%
mutate(difference = banana - apple) %>%
pivot_wider(names_from = var, values_from = apple:difference,
names_glue = "{var}_{.value}", names_vary = "slowest")
Result (truncated)
# A tibble: 10 × 10
row var1_apple var1_banana var1_difference var2_apple var2_banana var2_difference var3_apple var3_banana var3_difference
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 7 10 3 5 3 -2 1 9 8
2 2 9 2 -7 3 6 3 8 1 -7
3 3 2 10 8 3 3 0 7 8 1
4 4 3 1 -2 8 3 -5 9 9 0
5 5 2 7 5 7 10 3 6 9 3
6 6 5 4 -1 2 1 -1 5 4 -1
7 7 4 5 1 10 3 -7 9 4 -5
8 8 10 7 -3 3 2 -1 5 9 4
9 9 5 5 0 7 3 -4 10 7 -3
10 10 10 6 -4 1 4 3 10 10 0
Here is a SAS solution. As far as I understood your data looks like that: I tried to generate an example for 4 patients and 6 proteins, and two values for each protein (nasal and plasma).
data proteinvalues;
patient=1;
protein1_nas=12; protein1_plas=13; protein2_nas=6; protein2_plas=8;
protein3_nas=23; protein3_plas=24; protein4_nas=15; protein4_plas=15;
protein5_nas=45; protein5_plas=47; protein6_nas=56; protein6_plas=50;
output;
patient=2;
protein1_nas=1; protein1_plas=5; protein2_nas=6; protein2_plas=8;
protein3_nas=2; protein3_plas=4; protein4_nas=7; protein4_plas=9;
protein5_nas=3; protein5_plas=3; protein6_nas=8; protein6_plas=7;
output;
patient=3;
protein1_nas=0; protein1_plas=1; protein2_nas=20; protein2_plas=19;
protein3_nas=33; protein3_plas=5; protein4_nas=19; protein4_plas=20;
protein5_nas=32; protein5_plas=8; protein6_nas=12; protein6_plas=14;
output;
patient=4;
protein1_nas=4; protein1_plas=5; protein2_nas=9; protein2_plas=11;
protein3_nas=4; protein3_plas=6; protein4_nas=78; protein4_plas=70;
protein5_nas=4; protein5_plas=7; protein6_nas=78; protein6_plas=77;
output;
run;
OK, this data generating step is not quite elegant, but it works...
Here is my solution by using an array with all protein-variables as members. The value pairs are then adjacent members in the array, i.e. No.1/No.2 , No.3/No.4 etc...
/* Number of proteins, in your data 55, in my data 6 */
%let NUM_PROTEINS=6;
data result (keep=patient diff:);
set proteinvalues;
/* Array definition with all variable names starting with "protein" */
array protarr{*} protein:;
/* Array for the resulting values */
array diff(&NUM_PROTEINS.);
/* Initializing the protein number */
num_prot=0;
/* Loop over the protein by step 2 = one iteration per protein */
do n=1 to dim(protarr) by 2;
/* p is plasma, n is nasal value */
p=n+1;
/* setting the protein number */
num_prot+1;
/* calculate the difference, using sum-function to handle missing values */
diff{num_prot}=sum(protarr{p},-protarr{n});
end;
run;
I think #Tom's comment is spot-on. Restructuring the data probably makes sense if you are working with paired data. E.g.:
od <- names(df)[c(TRUE,FALSE)]
ev <- names(df)[c(FALSE,TRUE)]
data.frame(
odd = unlist(df[od]),
oddname = rep(od,each=nrow(df)),
even = unlist(df[ev]),
evenname = rep(ev,each=nrow(df))
)
## odd oddname even evenname
##X11 7 X1 10 X2
##X12 6 X1 1 X2
##X13 2 X1 6 X2
##X14 5 X1 2 X2
##X15 3 X1 1 X2
## ...
It is then trival to take one column from another in this structure.
If you must have the matrix-like output, then that is also achievable:
od <- names(df)[c(TRUE,FALSE)]
ev <- names(df)[c(FALSE,TRUE)]
setNames(df[od] - df[ev], paste(od, ev, sep="_"))
## X1_X2 X3_X4 X5_X6 X7_X8 X9_X10 X11_X12 X13_X14 X15_X16 X17_X18 X19_X20
##1 -3 2 4 4 -2 4 3 1 -3 9
##2 5 5 4 3 -1 3 -1 -3 5 -2
##3 -4 3 7 4 -5 1 1 5 -4 4
##4 3 0 6 3 4 -5 6 6 -7 4
##5 2 2 1 4 -6 -3 6 2 3 1
##6 -6 -2 4 -2 0 1 3 0 0 -7
##7 0 -6 3 7 -1 0 0 -5 3 1
##8 -1 3 3 1 2 -2 -5 3 0 0
##9 -4 1 -5 -2 -4 7 6 -2 4 -4
##10 2 -7 4 -1 0 -6 -4 -4 0 0
Related
Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...
I have a symmetrical matrix of flows (in tibble form) similar to the below example:
library(tibble)
set.seed(2019)
df1 <- as_tibble(matrix(sample(1:10,100,replace = T), nrow = 10, ncol = 10, byrow = TRUE,
dimnames = list(as.character(1:10),
as.character(1:10))))
df1
# `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 8 8 4 7 1 1 9 1 2 7
# 2 8 7 3 2 7 7 1 8 4 5
# 3 5 6 10 2 2 1 6 10 7 5
# 4 7 1 9 2 1 1 4 5 1 8
# 5 7 3 9 7 9 5 10 10 3 2
# 6 4 1 1 4 6 4 10 10 1 1
# 7 2 3 8 4 8 10 4 1 9 6
# 8 4 2 4 2 7 10 2 6 4 8
# 9 1 10 10 3 6 2 6 7 8 4
#10 6 8 9 3 6 9 5 10 4 10
I also have a lookup table that shows the broad groups that each flow subgroup fits into:
lookup <- tibble(sector = as.character(1:10),
aggregate_sector = c(rep('A',3), rep('B', 3), rep('C', 4)))
lookup
# sector aggregate_sector
#1 1 A
#2 2 A
#3 3 A
#4 4 B
#5 5 B
#6 6 B
#7 7 C
#8 8 C
#9 9 C
#10 10 C
I want to summarise my original df1 such that it represents the flows between each aggregate_sector (as per the lookup table) rather than each sector. Expected output:
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
My initial attempt has been to convert into a matrix and then use a nested for loop to calculate the sum of flows for each aggregate_sector combination in turn:
mdat <- as.matrix(df1)
# replace row and column names with group names - assumes lookup is in same order as row and col names...
row.names(mdat) <- lookup$aggregate_sector
colnames(mdat) <- lookup$aggregate_sector
# pre-allocate an empty matrix
new_mat <- matrix(nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], LETTERS[1:3]))
# fill in matrix section by section
for(i in row.names(new_mat)){
for(j in colnames(new_mat)){
new_mat[i,j] <- sum(mdat[which(row.names(mdat) ==i), which(colnames(mdat) ==j)])
}
}
new_mat
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
While this is a satisfactory solution, I wonder if there's a solution using dplyr or similar that uses nicer logic and saves me from having to convert my actual data (which is a tibble) into matrix form.
The key steps is to gather - after that is it all straightforward dplyr stuff:
flow_by_sector <-
df1 %>%
mutate(sector_from = rownames(.)) %>%
tidyr::gather(sector_to, flow, -sector_from)
flow_by_sector_with_agg <-
flow_by_sector %>%
left_join(lookup, by = c("sector_from" = "sector")) %>%
rename(agg_from = aggregate_sector) %>%
left_join(lookup, by = c("sector_to" = "sector")) %>%
rename(agg_to = aggregate_sector)
flow_by_agg <-
flow_by_sector_with_agg %>%
group_by(agg_from, agg_to) %>%
summarise(flow = sum(flow))
tidyr::spread(flow_by_agg, agg_to, flow)
Here's a base answer that uses stack and xtabs. It's not super robust - it assumes that the lookup table has the same columns and order as what would be expressed in the data.frame.
colnames(df1) <- lookup$aggregate_sector
xtabs(values ~ sector + ind
, dat = data.frame(sector = rep(lookup$aggregate_sector
, length(df1)), stack(df1))
)
Here's another way to do the data.frame:
xtabs(values ~ Var1 + Var2,
dat = data.frame(expand.grid(lookup$aggregate_sector, lookup$aggregate_sector)
, values = unlist(df1))
)
Var2
Var1 A B C
A 59 30 65
B 42 39 65
C 67 70 94
I actually figured out a matrix algebra alternative to my problem which is much faster despite having to convert my data.frame into a matrix. I won't accept this solution as I did ask specifically for a dplyr answer, but thought it interesting enough to post here anyway.
I first had to form an adjustment matrix, S, from my lookup table where the the locations of ones in row i of S indicate which sectors of the original matrix will be grouped together as sector i in the aggregated matrix:
S <- lookup %>% mutate(sector = as.numeric(sector), value = 1) %>%
spread(sector, value) %>%
column_to_rownames('aggregate_sector') %>%
as.matrix()
S[is.na(S)] <- 0
S
# 1 2 3 4 5 6 7 8 9 10
#A 1 1 1 0 0 0 0 0 0 0
#B 0 0 0 1 1 1 0 0 0 0
#C 0 0 0 0 0 0 1 1 1 1
Then, I convert my original data.frame, df1, into matrix x and simply calculate S.x.S' :
x <- as.matrix(df1)
S %*% x %*% t(S)
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
I'd like to have the sum of absolute values of multiple columns with certain characteristics, say their names end in _s.
set.seed(154)
d <- data.frame(a_s = sample(-10:10,6,replace=F),b_s = sample(-5:10,6,replace=F), c = sample(-10:5,6,replace=F))
d$s <- abs(d$a_s)+abs(d$b_s)
where the output is column s below:
a_s b_s c s
4 8 -2 12
10 6 -8 16
-10 -1 1 11
0 2 4 2
5 1 -3 6
8 -5 5 13
I can use d$ss <- rowSums(d[,grepl('_s',colnames(d))]) to sum the values but not the absolute values.
I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0
My data contains statistics on the outcome of a soccer game, with 12806 observations (match outcomes) and 34 key performance indicators.
A (small) example of my data.frame is below:
head(Test)
MatchID Outcome Var1 Var2 Var3 Var4 Var5
1 30 Loss 0 10 0 10 0
2 30 Win 6 13 6 13 6
3 31 Loss 8 12 3 6 3
4 31 Win 29 40 9 19 3
5 32 Loss 7 26 7 26 6
6 32 Win 11 20 11 20 9
For every unique "Match ID" I wish to deduct each of the losing (Outcome=="Loss" key performance indicators from the winning (Outcome=="Win") team. My data.set is not always arranged by Loss, Win, Loss, Win so completing this in a row.wise fashion may not be possible.
I have tried the following using dplyr:
Differences <- Test %>%
group_by(MatchID) %>%
summarise_at( .vars = names(.)[3:7], ((Outcome == "Win") - (Outcome == "Loss")))
but fear I am using the wrong approach as I received the following error: Error in inherits(x, "fun_list") : object 'Outcome' not found
My anticipated outcome would be:
head(AnticipatedOutcome)
MatchID Var1 Var2 Var3 Var4 Var5
1 30 6 3 6 3 6
3 31 21 28 6 13 0
5 32 4 -6 4 -6 3
Is this please possible to achieve, using dplyr?
The difference of two logical vectors is of the same length. We need to subset the 'Var' columns where 'Outcome' is "Win", take the sum of it and subtract it from those where the 'Outcome' is "Loss"
library(tidyverse)
Test %>%
group_by(MatchID) %>%
summarise_at(vars(starts_with('Var')),
funs(sum(.[Outcome == "Win"]) - sum(.[Outcome == "Loss"])))
# A tibble: 3 x 6
# MatchID Var1 Var2 Var3 Var4 Var5
# <int> <int> <int> <int> <int> <int>
#1 30 6 3 6 3 6
#2 31 21 28 6 13 0
#3 32 4 -6 4 -6 3
Or another option would be to gather into 'long' format, get the group by difference of sum and spread it to 'wide' format
Test %>%
gather(key, val, Var1:Var5) %>%
group_by(MatchID, key) %>%
summarise(val = sum(val[Outcome == "Win"]) - sum(val[Outcome == "Loss"])) %>%
spread(key, val)
One can use data.table with .SDcols argument to summarise the data. As #akrun has mentioned in his solution, sum of "Loss" will be subtracted from the sum of "Win" for each Match.
library(data.table)
setDT(df)
df[,lapply(.SD,function(x)sum(x[Outcome=="Win"]) - sum(x[Outcome=="Loss"])),
.SDcols = Var1:Var5,by=MatchID]
# MatchID Var1 Var2 Var3 Var4 Var5
# 1: 30 6 3 6 3 6
# 2: 31 21 28 6 13 0
# 3: 32 4 -6 4 -6 3
Note: Just for the sake of exploring different ideas, but one can achieve same result in even base-R:
cbind(unique(df[1]), df[order(df$MatchID),][df$Outcome == "Win",3:7] -
df[order(df$MatchID),][df$Outcome == "Loss",3:7])
# MatchID Var1 Var2 Var3 Var4 Var5
# 1 30 6 3 6 3 6
# 3 31 21 28 6 13 0
# 5 32 4 -6 4 -6 3
Data:
df <- read.table(text =
"MatchID Outcome Var1 Var2 Var3 Var4 Var5
1 30 Loss 0 10 0 10 0
2 30 Win 6 13 6 13 6
3 31 Loss 8 12 3 6 3
4 31 Win 29 40 9 19 3
5 32 Loss 7 26 7 26 6
6 32 Win 11 20 11 20 9",
header =TRUE, stringsAsFactors = FALSE)