aggregating output from multiple input files in R - r

Right now I have the R code below. It reads in data that looks like this:
track_id day hour month year rate gate_id pres_inter vmax_inter
9 10 0 7 1 9.6451E-06 2 97809 23.545
9 10 0 7 1 9.6451E-06 17 100170 13.843
10 3 6 7 1 9.6451E-06 2 96662 31.568
13 22 12 8 1 9.6451E-06 1 94449 48.466
13 22 12 8 1 9.6451E-06 17 96749 30.55
16 13 0 8 1 9.6451E-06 4 98702 19.205
16 13 0 8 1 9.6451E-06 16 98585 18.143
19 27 6 9 1 9.6451E-06 9 98838 20.053
19 27 6 9 1 9.6451E-06 17 99221 17.677
30 13 12 6 2 9.6451E-06 2 97876 27.687
30 13 12 6 2 9.6451E-06 16 99842 18.163
32 20 18 6 2 9.6451E-06 1 99307 17.527
##################################################################
# Input / Output variables
##################################################################
for (N in (59:96)){
if (N < 10){
# TrackID <- "000$N"
TrackID <- paste("000",N, sep="")
}
else{
# TrackID <- "00$N"
TrackID <- paste("00",N, sep="")
}
print(TrackID)
# For 2010_08_24 trackset
# fname_in <- paste('input/2010_08_24/intersections_track_calibrated_jma_from1951_',TrackID,'.csv', sep="")
# fname_out <- paste('output/2010_08_24/tracks_crossing_regional_polygon_',TrackID,'.csv', sep="")
# For 2012_05_01 trackset
fname_in <- paste('input/2012_05_01/intersections_track_param_',TrackID,'.csv', sep="")
fname_out <- paste('output/2012_05_01/tracks_crossing_regional_polygon_',TrackID,'.csv', sep="")
fname_out2 <- paste('output/2012_05_01/GateID_',TrackID,'.csv', sep="")
#######################################################################
# we read the gate crossing track date
cat('reading the crosstat output file', fname_in, '\n')
header <- read.table(fname_in, nrows=1)
track <- read.table(fname_in, sep=',', skip=1)
colnames(track) <- c("ID", "day", "month", "year", "hour", "rate", "gate_id", "pres_inter", "vmax_inter")
# track_id=track[,1]
# pres_inter=track[,15]
# Function to select maximum surge by stormID
ByTrack <- ddply(track, "ID", function(x) x[which.max(x$vmax_inter),])
ByGate <- count(track, vars="gate_id")
# Write the output file with a single record per storm
cat('Writing the full output file', fname_out, '\n')
write.table(ByTrack, fname_out, col.names=T, row.names=F, sep = ',')
# Write the output file with a single record per storm
cat('Writing the full output file', fname_out2, '\n')
write.table(ByGate, fname_out2, col.names=T, row.names=F, sep = ',')
}
My output for the final section of code is a file the groups by GateID and outputs the frequency of occurrence. It looks like this:
gate_id freq
1 935
2 2096
3 1363
4 963
5 167
6 17
7 43
8 62
9 208
10 267
11 64
12 162
13 178
14 632
15 807
16 2003
17 838
18 293
The thing is that I output a file that looks just like this for 96 different input files. Instead of outputting 96 separate files, I'd like to calculate these aggregations per input file, and then sum the frequency across all 96 inputs and print out one SINGLE output file. Can anyone help?
Thanks,
K

You are going to need to do something like the function below. This would grab all the .csv files in one directory, so that directory would have to have only the files you want to analyze in it.
myFun <- function(out.file = "mydata") {
files <- list.files(pattern = "\\.(csv|CSV)$")
# Use this next line if you are going use the file name as a variable/output etc
files.noext <- substr(basename(files), 1, nchar(basename(files)) - 4)
for (i in 1:length(files)) {
temp <- read.csv(files[i], header = FALSE)
# YOUR CODE HERE
# Use the code you have already written but operate on files[i] or temp
# Save the important stuff into one data frame that grows
# Think carefully ahead of time what structure makes the most sense
}
datafile <- paste(out.file, ".csv", sep = "")
write.csv(yourDataFrame, file = datafile)
}

Related

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Difficulty converting wide format to tidy format in dataset

I am using Kaggles gun violence dataset. My goal is to use Tableau for a interactive visualization for some of the regions and specifics relating to gun crimes there. My goal is to turn this dataframe into tidy format. Link:
https://www.kaggle.com/jameslko/gun-violence-data/version/1
With that being the case, there are a couple columns formatted like this that I am having issues wrangling in R. There are around 20 or so columns, these 4 are formatted like this:
A little background: there can be more than one gun involved in a crime, and more than one participant. Due to this, these columns contain information for each gun/participant split by '||'. The 0:, 1: ... indicates details for that specific gun/participant.
My goal is to capture the unique instances in each column and disregard the 0:, 1:, 2:, ...
Here is my code so far:
df= read.csv("C:/Users/rmahesh/Desktop/gun-violence-data_01-2013_03-2018.csv")
df$incident_id = NULL
df$incident_url = NULL
df$source_url = NULL
df$participant_name = NULL
df$participant_relationship = NULL
df$sources = NULL
df$incident_url_fields_missing = NULL
df$participant_status = NULL
df$participant_age_group = NULL
df$participant_type = NULL
df$incident_characteristics = NULL
#Subset of columns with formatting issues:
df2 = df[, c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')]
I have yet to run into an issue like this, and would love any help figuring out how to solve my problem. Any help would be greatly appreciated!
Edit1: I have created the first 3 rows of the columns in question. The format is identical more or less with some columns missing at times:
gun_stolen,gun_type,participant_age,participant_gender
0::Unknown||1::Unknown, 0::Unknown||1::Unknown, 0::25||1::31||2::33||3::34||4::33, 0::Male||1::Male||2::Male||3::Male||4::Male
0::Unknown||1::Unknown,0::22 LR||1::223 Rem [AR-15],0::51||1::40||2::9||3::5||4::2||5::15,0::Male||1::Female||2::Male||3::Female||4::Female||5::Male
0::Unknown,0::Shotgun,3::78||4::48,0::Male||1::Male||2::Male||3::Male||4::Male
As Frank said in the comments, "tidy" can mean different things. Here we turn all specified columns in just two: one with the original column name ("key"), the other with the individual values after splitting the strings and removing the prefixes, one row for each ("value").
library(tidyr)
library(dplyr)
library(stringr)
myvars <- c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')
res <- as_tibble(df2) %>%
tibble::rowid_to_column() %>%
# Split strings in selected columns at "||". This turns those columns in
# list-columns of character vectors
mutate_at(myvars, str_split, pattern = fixed("||")) %>%
# Go from wide to long format: in the new 'key' column are the original column
# names, and 'value' is the one list-column of character vectors
gather(key, value, one_of(myvars)) %>%
# unnest turns the 'value' list-column into a regular character column, with
# duplication of rows that contain a 'value' of length greater than 1
unnest(value) %>%
filter(value != "") %>%
# Remove the "x::" prefixes
mutate(value = str_split_fixed(value, fixed("::"), n = 2)[, 2]) %>%
# Deduplicate
distinct() %>%
arrange(rowid, key, value)
# # A tibble: 732,017 x 3
# rowid key value
# <int> <chr> <chr>
# 1 1 participant_age 20
# 2 1 participant_gender Female
# 3 1 participant_gender Male
# 4 2 participant_age 20
# 5 2 participant_gender Male
# 6 3 gun_stolen Unknown
# 7 3 gun_type Unknown
# 8 3 participant_age 25
# 9 3 participant_age 31
# 10 3 participant_age 33
# # ... with 732,007 more rows
Also expanding on #Ben G's comment:
res %>%
count(key, value) %>%
arrange(key, desc(n))
# # A tibble: 141 x 3
# key value n
# <chr> <chr> <int>
# 1 gun_stolen Unknown 132099
# 2 gun_stolen Stolen 7350
# 3 gun_stolen Not-stolen 1560
# 4 gun_stolen "" 355
# 5 gun_type Unknown 98892
# 6 gun_type Handgun 17609
# 7 gun_type 9mm 6040
# 8 gun_type Shotgun 3560
# 9 gun_type Rifle 3196
# 10 gun_type 22 LR 3093
# 11 gun_type 40 SW 2624
# 12 gun_type 380 Auto 2323
# 13 gun_type 45 Auto 2234
# 14 gun_type 38 Spl 1758
# 15 gun_type 223 Rem [AR-15] 1248
# 16 gun_type 12 gauge 975
# 17 gun_type Other 892
# 18 gun_type 7.62 [AK-47] 854
# 19 gun_type 357 Mag 800
# 20 gun_type 25 Auto 601
# 21 gun_type 32 Auto 481
# 22 gun_type "" 356
# 23 gun_type 20 gauge 194
# 24 gun_type 44 Mag 192
# 25 gun_type 30-30 Win 105
# 26 gun_type 410 gauge 96
# 27 gun_type 308 Win 88
# 28 gun_type 30-06 Spr 71
# 29 gun_type 10mm 50
# 30 gun_type 16 gauge 30
# 31 gun_type 300 Win 23
# 32 gun_type 28 gauge 6
# 33 participant_age 19 10541
# 34 participant_age 20 9919
# 35 participant_age 18 9826
# 36 participant_age 21 9795
# 37 participant_age 22 9642
# 38 participant_age 23 9383
# 39 participant_age 24 9204
# 40 participant_age 25 8562
# 41 participant_age 26 7815
# 42 participant_age 17 7416
# 43 participant_age 27 7228
# 44 participant_age 28 6528
# 45 participant_age 29 6055
# 46 participant_age 30 5652
# 47 participant_age 31 5145
# 48 participant_age 32 5039
# 49 participant_age 16 4977
# 50 participant_age 33 4662
# # ... with 91 more rows
I think by tidying you mean split the contents of delimited columns and separate into rows. You can either take the first element or take each element as its own row.
df<-data.frame(instance=1:5,
gun_type=c("", "0::Unknown||1::Unknown", "",
"0::Handgun||1::Handgun", ""), stringsAsFactors=FALSE)
df$first<-sapply(strsplit(df$gun_type, "\\|\\|"), '[', 1)
splitType<-strsplit(df$gun_type, "\\|\\|")
df.2<-df[rep(1:nrow(df), sapply(splitType, length)),]
df.2$splitType<-unlist(splitType)
If you want just the unique values then use:
splitTypeUnique<-sapply(splitType, unique)
df.2<-df[rep(1:nrow(df), sapply(splitTypeUnique, length)),]
df.2$splitType<-unlist(splitTypeUnique)
but you will have to do a little wrangling to get the unique part to work

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like
1 USA China
2 Dollars Volume Dollars Volume
3 Category Brand CY2016 CY2017 CY2016 CY2017 CY2016 CY_2017 CY2016 CY2017
4 Chocolate Snickers 100 120 15 18 100 80 20 22
5 Chocolate Twix 70 80 8 10 75 50 55 20
I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.
1 Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016 China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
2 Chocolate Snickers 100 120 15 18 100 80 20 22
3 Chocolate Twix 70 80 8 10 75 50 55 20
My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!
I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:
header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)
The tricky part is the header. This function should do it:
construct_colnames <- function(header) {
f <- function(x) {
x <- as.character(x)
c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
}
res <- apply(header, 1, f)
res <- apply(res, 1, paste0, collapse="_")
sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat
Result:
Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers 100 120 15 18 100
2 Chocolate Twix 70 80 8 10 75
China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1 80 20 22
2 50 55 20

Plot a decision tree with R

I have a 440*2 matrix that looks like:
1 144
1 152
1 135
2 3
2 12
2 107
2 31
3 4
3 147
3 0
4 end
4 0
4 0
5 6
5 7
5 10
5 9
The left column are the starting points eg in the app all the 1's on the left would be on the same page. They lead to three choices, pages 144,152,135. These pages can each then lead to another page, and so on until the right hand column says 'end'. What I would like is a way to visualise the scale of this tree. I realise it will be quite large given the nb of rows so maybe not graph friendly, so for clarity I want to know how many possible routes there are in total (from every start point, down every option it gives and the end destinations of each. I realise there will be overlaps but thats why I am finding this hard to calculate).
secondly, each number has an associated title. I would like to have a function whereby if you input a given title it will plot all the possible starting points and their associated paths that will lead there. This should be a lot smaller and therefore graph friendly.
e.g.
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll"
Edited example data to show that some branches are not connected to desired tree
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
11 42 AAA
42 154 BBB
154 end Coll"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
library(data.tree)
Warning message:
package ‘data.tree’ was built under R version 3.2.5
tree <- FromDataFrameNetwork(df)
**Error in FromDataFrameNetwork(df) :**
**Cannot find root name. network is not a tree!**
I made this example to show how column 1 leads to a value in column 2 which then refers to a value in column 1 until you reach the end. Different starting points can ultimately lead to different length paths to same destination. so this would look sometigng like:
So here, I wanted to see how you could go from all start points to 'Coll'
greatly appreciate any help
If you have indeed a tree (e.g. no cycles), you can use data.tree:
Start by converting to a data.frame:
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll
55 end efg
12 end hij"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
Now, convert to a data.tree:
library(data.tree)
tree <- FromDataFrameNetwork(df)
tree$leafCount
You can now navigate to any sub-tree, for analysis and plotting. E.g. using any of the following possibilities:
subTree <- tree$FindNode(187)
subTree <- Climb(tree, nme = "Coll", nme = "Low")
subTree <- tree$`154`$`187`
subTree <- Clone(tree$`154`)
Maybe printing is all you need:
print(subTree , "nme")
This will print like so:
levelName nme
1 154 Coll
2 ¦--187 Low
3 ¦ °--186 Frac
4 °--42 BBB
5 °--11 AAA
6 °--52 Lip
7 °--23 Med
Otherwise, use fancy plotting:
SetNodeStyle(subTree , style = "filled,rounded", shape = "box", fontname = "helvetica", label = function(node) node$nme, tooltip = "name")
plot(subTree , direction = "descend")
This looks like this:

CSV conversion in R for standard calculations

I have a problem calculating the mean of columns for a dataset imported from this CSV file
I import the file using the following command:
dataGSR = read.csv("ShimmerData.csv", header = TRUE, sep = ",",stringsAsFactors=T)
dataGSR$X=NULL #don't need this column
Then I take a subset of this
dati=dataGSR[4:1000,]
i check they are correct
head(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 31329 0 713 623.674691281028 2545 3706.5641025641 2409 3529.67032967033
5 31649 9.765625 713 623.674691281028 2526 3678.89230769231 2501 3664.46886446886
6 31969 19.53125 712 638.528829576655 2528 3681.80512820513 2501 3664.46886446886
7 32289 29.296875 713 623.674691281028 2516 3664.3282051282 2498 3660.07326007326
8 32609 39.0625 711 654.10779696494 2503 3645.39487179487 2496 3657.14285714286
9 32929 48.828125 713 623.674691281028 2505 3648.30769230769 2496 3657.14285714286
When I type
means=colMeans(dati)
Error in colMeans(dati) : 'x' must be numeric
In order to solve this problem I convert everything into a matrix
datiM=data.matrix(dati)
But when I check the new variable, data values are different
head(datiM)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 370 1 10 1 65 65 1 1
5 375 3707 10 1 46 46 24 24
6 381 1025 9 2 48 48 24 24
7 386 2162 10 1 36 36 21 21
8 392 3126 8 3 23 23 19 19
9 397 3229 10 1 25 25 19 19
My questions here is:
How to convert correctly the "dati" variable in order to perform the colMeans()?
In addition to #akrun's advice, another option is to convert the columns to numeric yourself (rather than having read.csv do it):
dati <- data.frame(
lapply(dataGSR[-c(1:3),-9],as.numeric))
##
R> colMeans(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
33004.2924 18647.4609 707.4335 718.3989 2521.3626 3672.1383 2497.9013 3659.9287
Where dataGSR was read in with stringsAsFactors=F,
dataGSR <- read.csv(
file="F:/temp/ShimmerData.csv",
header=TRUE,
stringsAsFactors=F)
Unless you know for sure that you need character columns to be factors, you are better off setting this option to FALSE.
The header lines ("character") in the dataset span first 4 lines. We could skip the 4 lines, use header=FALSE and then change the column names based on the info from the first 4 lines.
dataGSR <- read.csv('ShimmerData.csv', header=FALSE,
stringsAsFactors=FALSE, skip=4)
lines <- readLines('ShimmerData.csv', n=4)
colnames(dataGSR) <- do.call(paste, c(strsplit(lines, ','),
list(sep="_")))
dataGSR <- dataGSR[,-9]
unname(colMeans(dataGSR))
# [1] 33004.2924 18647.4609 707.4335 718.3989 2521.3626
# 3672.1383 2497.9013
# [8] 3659.9287

Resources