writing out .dat file in r - r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!

We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')

with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Related

How to change column names for mrset in R?

I am trying to create crosstabs I have a dataframe in which I have multiple select questions. I am importing the data frame from SPSS file using foreign and expss package. I am creating the multiple select questions using the mrset function. Here's the demo code for this to make it clear.
Banner1 = w %>%
tab_cells(mrset(as.category( temp1,counted_value = "Checked"))) %>%
tab_cols(total(),mrset(as.category( temp2, counted_value = "Checked"))) %>%
tab_stat_cases(total_row_position = "none",label = "")
tab_pivot(Banner1)
The datatable imported looks like this
Total Q12_1 Q12_2 Q12_3 Q12_4 Q12_5
A B C D E F
Total Cases 803 34 18 14 38 37
Q13_1 64 11 7 8 9 7
Q13_2 12 54 54 43 13 12
Q13_3 67 54 23 21 6 4
Sorry about the alignment here....So this is the imported dataset.
Coming to the problem, As you can see this dataset has column labels as Question numbers and not variable labels. For single select questions everything works fine. Is there any function I can change the colnames for mrset functions dynamically?
The desired output should be something like this. For eg,
Total Apple Mango Banana Orange Grapes
A B C D E F
Total Cases 803 34 18 14 38 37
Apple 64 11 7 8 9 7
Mango 12 54 54 43 13 12
banana 67 54 23 21 6 4
Any help would be greatly appreciated.

tidyr::gather can't find my value variables

I have an extremely large data.frame. I reproduce part of it.
RECORDING_SESSION_LABEL condition TRIAL_INDEX IA_LABEL IA_DWELL_TIME
1 23 match 1 eyes 3580
2 23 match 1 nose 2410
3 23 match 1 mouth 1442
4 23 match 1 face 841
5 23 mismatch 3 eyes 1817
6 23 mismatch 3 nose 1724
7 23 mismatch 3 mouth 1600
8 23 mismatch 3 face 1136
9 23 mismatch 4 eyes 4812
10 23 mismatch 4 nose 3710
11 23 mismatch 4 mouth 4684
12 23 mismatch 4 face 1557
13 23 mismatch 6 eyes 4645
14 23 mismatch 6 nose 2321
15 23 mismatch 6 mouth 674
16 23 mismatch 6 face 684
17 23 match 7 eyes 1062
18 23 match 7 nose 1359
19 23 match 7 mouth 215
20 23 match 7 face 0
I need to calculate the percentage of IA_DWELL_TIME for each IA_LABEL in each trial index. For that, I first put IA_label in different columns
data_IA_DWELL_TIME <- tidyr::spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
For calculating the percentage, I create a new dataframe:
data_IA_DWELL_TIME_percentage <-data_IA_DWELL_TIME
data_IA_DWELL_TIME_percentage$eyes <- 100*(data_IA_DWELL_TIME$eyes/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$nose <- 100*(data_IA_DWELL_TIME$nose/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$mouth <- 100*(data_IA_DWELL_TIME$mouth/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$face <- 100*(data_IA_DWELL_TIME$face/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
So all is fine, and I get the wanted output. The problem is when I want to put the columns back to the rows
data_IA_DWELL_TIME_percentage <- tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)
I obtain this error:
Error in tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,
-RECORDING_SESSION_LABEL, : object 'RECORDING_SESSION_LABEL' not found
>
Any idea of what is going on here? Thanks!
As explained, you're not referring to your data frame in the gather statement.
However, you could avoid the need for referring to it altogether and put the second part in a dplyr pipeline, like below:
library(dplyr)
library(tidyr)
data_IA_DWELL_TIME <- spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
data_IA_DWELL_TIME %>%
mutate_at(
vars(eyes, nose, mouth, face),
~ 100 * (. / (rowSums(data_IA_DWELL_TIME[, c("eyes", "nose", "mouth", "face")])))
) %>%
gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)

Cumulative count of values in R

I hope you are doing very well. I would like to know how to calculate the cumulative sum of a data set with certain conditions. A simplified version of my data set would look like:
t id
A 22
A 22
R 22
A 41
A 98
A 98
A 98
R 98
A 46
A 46
R 46
A 46
A 46
A 46
R 46
A 46
A 12
R 54
A 66
R 13
A 13
A 13
A 13
A 13
R 13
A 13
Would like to make a new data set where, for each value of "id", I would have the cumulative number of times that each id appears , but when t=R I need to restart the counting e.g.
t id count
A 22 1
A 22 2
R 22 0
A 41 1
A 98 1
A 98 2
A 98 3
R 98 0
A 46 1
A 46 2
R 46 0
A 46 1
A 46 2
A 46 3
R 46 0
A 46 1
A 12 1
R 54 0
A 66 1
R 13 0
A 13 1
A 13 2
A 13 3
A 13 4
R 13 0
A 13 1
Any ideas as to how to do this? Thanks in advance.
Using rle:
out <- transform(df, count = sequence(rle(do.call(paste, df))$lengths))
out$count[out$t == "R"] <- 0
If your data.frame has more than these two columns, and you want to check only these two columns, then, just replace df with df[, 1:2] (or) df[, c("t", "id")].
If you find do.call(paste, df) dangerous (as #flodel comments), then you can replace that with:
as.character(interaction(df))
I personally don't find anything dangerous or clumsy with this setup (as long as you have the right separator, meaning you know your data well). However, if you do find it as such, the second solution may help you.
Update:
For those who don't like using do.call(paste, df) or as.character(interaction(df)) (please see the comment exchanges between me, #flodel and #HongOoi), here's another base solution:
idx <- which(df$t == "R")
ww <- NULL
if (length(idx) > 0) {
ww <- c(min(idx), diff(idx), nrow(df)-max(idx))
df <- transform(df, count = ave(id, rep(seq_along(ww), ww),
FUN=function(y) sequence(rle(y)$lengths)))
df$count[idx] <- 0
} else {
df$count <- seq_len(nrow(df))
}

aggregating output from multiple input files in R

Right now I have the R code below. It reads in data that looks like this:
track_id day hour month year rate gate_id pres_inter vmax_inter
9 10 0 7 1 9.6451E-06 2 97809 23.545
9 10 0 7 1 9.6451E-06 17 100170 13.843
10 3 6 7 1 9.6451E-06 2 96662 31.568
13 22 12 8 1 9.6451E-06 1 94449 48.466
13 22 12 8 1 9.6451E-06 17 96749 30.55
16 13 0 8 1 9.6451E-06 4 98702 19.205
16 13 0 8 1 9.6451E-06 16 98585 18.143
19 27 6 9 1 9.6451E-06 9 98838 20.053
19 27 6 9 1 9.6451E-06 17 99221 17.677
30 13 12 6 2 9.6451E-06 2 97876 27.687
30 13 12 6 2 9.6451E-06 16 99842 18.163
32 20 18 6 2 9.6451E-06 1 99307 17.527
##################################################################
# Input / Output variables
##################################################################
for (N in (59:96)){
if (N < 10){
# TrackID <- "000$N"
TrackID <- paste("000",N, sep="")
}
else{
# TrackID <- "00$N"
TrackID <- paste("00",N, sep="")
}
print(TrackID)
# For 2010_08_24 trackset
# fname_in <- paste('input/2010_08_24/intersections_track_calibrated_jma_from1951_',TrackID,'.csv', sep="")
# fname_out <- paste('output/2010_08_24/tracks_crossing_regional_polygon_',TrackID,'.csv', sep="")
# For 2012_05_01 trackset
fname_in <- paste('input/2012_05_01/intersections_track_param_',TrackID,'.csv', sep="")
fname_out <- paste('output/2012_05_01/tracks_crossing_regional_polygon_',TrackID,'.csv', sep="")
fname_out2 <- paste('output/2012_05_01/GateID_',TrackID,'.csv', sep="")
#######################################################################
# we read the gate crossing track date
cat('reading the crosstat output file', fname_in, '\n')
header <- read.table(fname_in, nrows=1)
track <- read.table(fname_in, sep=',', skip=1)
colnames(track) <- c("ID", "day", "month", "year", "hour", "rate", "gate_id", "pres_inter", "vmax_inter")
# track_id=track[,1]
# pres_inter=track[,15]
# Function to select maximum surge by stormID
ByTrack <- ddply(track, "ID", function(x) x[which.max(x$vmax_inter),])
ByGate <- count(track, vars="gate_id")
# Write the output file with a single record per storm
cat('Writing the full output file', fname_out, '\n')
write.table(ByTrack, fname_out, col.names=T, row.names=F, sep = ',')
# Write the output file with a single record per storm
cat('Writing the full output file', fname_out2, '\n')
write.table(ByGate, fname_out2, col.names=T, row.names=F, sep = ',')
}
My output for the final section of code is a file the groups by GateID and outputs the frequency of occurrence. It looks like this:
gate_id freq
1 935
2 2096
3 1363
4 963
5 167
6 17
7 43
8 62
9 208
10 267
11 64
12 162
13 178
14 632
15 807
16 2003
17 838
18 293
The thing is that I output a file that looks just like this for 96 different input files. Instead of outputting 96 separate files, I'd like to calculate these aggregations per input file, and then sum the frequency across all 96 inputs and print out one SINGLE output file. Can anyone help?
Thanks,
K
You are going to need to do something like the function below. This would grab all the .csv files in one directory, so that directory would have to have only the files you want to analyze in it.
myFun <- function(out.file = "mydata") {
files <- list.files(pattern = "\\.(csv|CSV)$")
# Use this next line if you are going use the file name as a variable/output etc
files.noext <- substr(basename(files), 1, nchar(basename(files)) - 4)
for (i in 1:length(files)) {
temp <- read.csv(files[i], header = FALSE)
# YOUR CODE HERE
# Use the code you have already written but operate on files[i] or temp
# Save the important stuff into one data frame that grows
# Think carefully ahead of time what structure makes the most sense
}
datafile <- paste(out.file, ".csv", sep = "")
write.csv(yourDataFrame, file = datafile)
}

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources