Reading fixed width format data into R with entries exceeding column width - r

I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0
999 13980 Blacksburg-Christiansburg-Radford,
VA 543 455 0 4 84 3
145 14010 Bloomington, IL 342 214 4 0 124 7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
NY 1964 931 14 14 1005 68
268 15500 Burlington, NC 1353 938 12 16 387 20
As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.
Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.
I ran #markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:
"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20
Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?

Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.
Step 1 Read the file in and assign the split records a common record id
library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>
Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.
df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>
Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.
Example data
library(readr)
# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"
write_file(txt, "example.txt")
Solution
col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>

Related

Force the application of the right as_tibble function on an object in a purrr::map call in R

I am trying to use the fact that as_tibble is a generic function so that I can process a column containing one out of two types of objects in the same way.
The list column temp can store either a list or an object of type AsspDataObj.
I can define an as_tibble function for that class
library(dplyr)
as_tibble.AsspDataObj <- function(x,field=1, prefix=NULL,na.zeros=TRUE){
df <- data.frame(x[[field]])
if(is.null(prefix)){
if(is.numeric(field)){
prefix <- names(x)[field]
}else{
prefix <- field
}
}
colnames(df) <- paste(prefix,seq(1,ncol(df),1),sep="_")
times <- seq(from=attr(x,"startTime"),
by=1/attr(x,"sampleRate"),
length.out=nrow(df))
out <-
tibble(times_orig=times,
times_rel=seq(from=0,to=(attr(x,"endRecord")-1)* 1000/attr(x,"sampleRate") ,by=1000/attr(x,"sampleRate")),
times_norm=times_rel / (max(times_rel) - min(times_rel))
) %>%
dplyr::bind_cols(df)
if(na.zeros){
out <- out %>%
dplyr::mutate(across(!times_orig & !times_rel & !times_norm, ~ na_if(.,0)))
}
return(out)
}
and then apply the function to one of the stored objects and get the expected result.
> class(pluck(illustration, "temp",1))
[1] "AsspDataObj"
> as_tibble(pluck(illustration, "temp",1))
# A tibble: 581 × 7
times_orig times_rel times_norm fm_1 fm_2 fm_3 fm_4
<dbl> <dbl> <dbl> <int> <int> <int> <int>
1 0.0025 0 0 NA 1062 2073 3156
2 0.0075 5 0.00172 1239 2109 3113 4247
3 0.0125 10 0.00345 NA 1352 2316 3310
4 0.0175 15 0.00517 NA 1448 2555 3870
5 0.0225 20 0.00690 NA 1438 2564 3958
[...]
Now, I want to apply the function to each object, and expand (unnest) the output so that result is actually the as_tibble output columns for each stored AsspDataObj, but joined with the other columns in the illustration data set.
Now, if I do this (and simplify the output a but by removing some columns just for now)
> illustration %>% mutate(out = map(.x = temp,.f=as_tibble)) %>% select(sl_rowIdx,out) %>% unnest(out)
I get this output
# A tibble: 1,162 × 10
sl_rowIdx frame_time fm1 fm2 fm3 fm4 bw1 bw2 bw3 bw4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.5 0 1062 2073 3156 0 800 369 890
2 1 7.5 1239 2109 3113 4247 644 493 792 859
3 1 12.5 0 1352 2316 3310 0 486 762 933
4 1 17.5 0 1448 2555 3870 0 577 716 1442
5 1 22.5 0 1438 2564 3958 0 548 552 1062
6 1 27.5 0 1343 2518 4001 0 637 617 1096
which is not the output I would expect to see if the as_tibble generic above had been applied. Instead, it seems that some other as_tibble function has been applied to the object.
Why? How do I find out what as_tibble.* function has been applied here? How do I force the application of my definition?
You need a data excerpt https://umeauniversity-my.sharepoint.com/:u:/g/personal/frkkan96_ad_umu_se/ET_56GqikHxAuFc60y_ce1UBL0lOOJcqWaMDjDwLOxjuOw?e=IsEZPf
Sorry, I found the answer. The problem was in the lexical scoping of R (I think) with a (not exported) definition of as_table.AsspDataObj defined in the package where the AsspDataObj was defined
https://github.com/IPS-LMU/wrassp/blob/cbbc6e9fe100f5f32f7b30510f3008c5a5553440/R/AsspDataObj.R#L357
being called instead of my function.
Confusing.

Percentages over Multiple Columns with two Factors

I'm trying to get relative proportions of tallies that belong to two seperate categories. This is an example of the raw file.
A tibble: 8 x 5
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 Paper
<fct> <int> <int> <int> <fct>
1 Remain 316 290 313 Times
2 Leave 157 123 159 Times
3 Will Not Vote 2 3 3 Times
4 Don't Know 56 51 55 Times
5 Remain 190 175 199 Telegraph
6 Leave 339 282 334 Telegraph
7 Will Not Vote 4 3 4 Telegraph
8 Don't Know 70 62 69 Telegraph
It is a tally of two different factors. I'm trying to convert the tally of responses into percentages so it would look something like this:
A tibble: 8 x 5
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 Paper
1 Remain 52% 53% .. Times
2 Leave 43% 42% .. Times
3 Will Not Vote 1% 2% . Times
4 Don't Know 4% 3% . Times
5 Remain 35% 35% . Telegraph
6 Leave 52% 52% . Telegraph
7 Will Not Vote 2% 2% . Telegraph
8 Don't Know 11% 11% . Telegraph
(Obviously these numbers aren't correct, but I hope it shows that each 4 x 1 section should sum to 100%).
The dataframe is in a similar format to table already, so is there a way to apply the prop.table method to the df ? When I tried like this, it refuses as the df is not a clean array. Is there a way around this?
for_stack <- combined_tallies %>%
group_by(Paper, resp) %>%
prop.table(margin=2)
Here is an rds copy of the dataframe if this helps!
[The best answers I could find elsewhere here in SO were of no use] (Percentage of factor levels by group in R)
I have recreated your data set using dput(), which you are encouranged to use to provide reproducible data for answers on StackOverflow.
votes <- structure(list(resp = c("Remain", "Leave", "Will Not Vote", "Don’t Know",
"Remain", "Leave", "Will Not Vote", "Don’t Know"), ref1 = c(316,
157, 2, 56, 190, 339, 4, 70), ref2 = c(290, 123, 3, 51, 175,
282, 3, 62), ref3 = c(313, 159, 3, 55, 199, 334, 4, 69), paper = c("Times",
"Times", "Times", "Times", "Telegraph", "Telegraph", "Telegraph",
"Telegraph")), .Names = c("resp", "ref1", "ref2", "ref3", "paper"
), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
An alternative approach is to change the structure of your dataset ahead of performing your analysis. You are trying to create relative values not across entire columns or rows but for subsets. One way around this is to use the tidyverse package and perform your analysis in that format. You can always revert to the original structure once you have calculated percentages.
library(tidyverse)
vote_long <- votes %>%
pivot_longer(cols = c(ref1, ref2, ref3), names_to = "ref", values_to = "votes")
vote_long
# A tibble: 24 x 4
resp paper ref votes
<chr> <chr> <chr> <dbl>
1 Remain Times ref1 316
2 Remain Times ref2 290
3 Remain Times ref3 313
4 Leave Times ref1 157
5 Leave Times ref2 123
6 Leave Times ref3 159
7 Will Not Vote Times ref1 2
8 Will Not Vote Times ref2 3
9 Will Not Vote Times ref3 3
10 Don’t Know Times ref1 56
# … with 14 more rows
# created grouped relative values
vote_long_relative <- vote_long %>%
group_by(paper, ref) %>%
mutate(rel_votes = votes/sum(votes) * 100)
vote_wide_relative <- vote_long_relative %>%
select(-votes) %>%
pivot_wider(id_cols = c(resp, paper), names_from = "ref", values_from = "rel_votes")
vote_wide_relative
# Groups: paper [2]
resp paper ref1 ref2 ref3
<chr> <chr> <dbl> <dbl> <dbl>
1 Remain Times 59.5 62.1 59.1
2 Leave Times 29.6 26.3 30
3 Will Not Vote Times 0.377 0.642 0.566
4 Don’t Know Times 10.5 10.9 10.4
5 Remain Telegraph 31.5 33.5 32.8
6 Leave Telegraph 56.2 54.0 55.1
7 Will Not Vote Telegraph 0.663 0.575 0.660
8 Don’t Know Telegraph 11.6 11.9 11.4
maybe you are looking for it
library(tidyverse)
combined_tallies %>%
group_by(Paper) %>%
mutate(across(where(is.numeric), ~ .x / sum(.x, na.rm = T) * 100))
# A tibble: 20 x 10
# Groups: Paper [5]
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 euRefVoteW4 euRefVoteW6 euRefVoteW7 euRefVoteW8
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Rema~ 59.5 62.1 59.1 61.0 63.7 60.3 61.2
2 Leave 29.6 26.3 30 29.0 25.2 35.6 35.2
3 Will~ 0.377 0.642 0.566 0.565 0.377 0.377 0.377
4 Don'~ 10.5 10.9 10.4 9.42 10.7 3.77 3.20
...

Import CSV multiple ranges and headers

I am trying to import the following file with two repeated sections of data to extract. The first set begins with unused header (line 5) and real header beginning with "ES" line 5). The next section of data begins with unused header (line 13) and real header beginning with "LU" (line 14) and more variable names. There are many of these files and in each there are different numbers of EU and LS sections of varying length. I need to extract the LS and EU data to separate dataframes. Unfortunately, the files are "as is" off of a sensor array and I can't change that and would prefer not to do all this in excel but may have to. In the real files there may be hundreds of these rows for each EU and LS set.
I have tried to adapt the following code to index EU sections and would then extract that and clean it up and then would have done the same on the LS sections but I did not even get this to work. Part of the reason is that EU was in both header rows. I did see code using perl scripts but have never used that language.
lns = readLines("lake1.txt")
idx = grepl("EU", lns)
df = read.table(text=lns[!idx])
wd = diff(c(which(idx), length(idx) + 1)) - 1
df$label = rep(lns[idx], wd)
I was not sure the best way to add a CSV file example but here it is...
Garbage Text 1,,,,,,,,
Garbage Text 2,,,,,,,,
Garbage Text 3,,,,,,,,
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),ES_DARK ('uW/cm^2/nm'),ES_DARK ('uW/cm^2/nm'),ES_DARK ('uW/cm^2/nm'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,0.35441789,-0.00060208,0.10290995,87,2017015,10:42:39,1
0.032,0,-0.36023974,-0.22242269,-0.09639,109,2017015,10:42:40,10
0.032,0,0.07552711,0.01524224,-0.16756855,91,2017015,10:42:48,41
,,,,,,,,11304
,,,,,,,,11312
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,0.02288441,0.02891912,0.03595322,53,2017015,10:42:38,2
0.032,0,-0.00014323,0.00024047,0.00001585,212,2017015,10:42:38,6
0.032,0,0.00114258,0.00091736,-0.0000495,16,2017015,10:42:39,9
0.032,0,0.00020744,0.0004186,0.00027721,118,2017015,10:42:40,16
,,,,,,,,11310
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),ES ('uW/cm^2/nm'),ES ('uW/cm^2/nm'),ES ('uW/cm^2/nm'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,56.7600789,59.43147464,62.83968564,186,2017015,10:42:38,3
0.032,0,56.27202003,59.52654061,62.86815706,29,2017015,10:42:38,4
,,,,,,,,11309
,,,,,,,,11311
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,-0.00011611,-0.00039544,-0.00014584,3,2017015,10:42:42,20
0.032,0,-0.00032394,-0.00020563,-0.00020383,229,2017015,10:42:46,39
This is what the two dataframes should look like in the end:
Dataframe 1
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,0.35441789,-0.00060208,0.10290995,87,2017015,10:42:39,1
0.032,0,-0.36023974,-0.22242269,-0.09639,109,2017015,10:42:40,10
0.032,0,0.07552711,0.01524224,-0.16756855,91,2017015,10:42:48,41
0.032,0,56.7600789,59.43147464,62.83968564,186,2017015,10:42:38,3
0.032,0,56.27202003,59.52654061,62.86815706,29,2017015,10:42:38,4
Dataframe 2
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,0.02288441,0.02891912,0.03595322,53,2017015,10:42:38,2
0.032,0,-0.00014323,0.00024047,0.00001585,212,2017015,10:42:38,6
0.032,0,0.00114258,0.00091736,-0.0000495,16,2017015,10:42:39,9
0.032,0,0.00020744,0.0004186,0.00027721,118,2017015,10:42:40,16
0.032,0,-0.00011611,-0.00039544,-0.00014584,3,2017015,10:42:42,20
0.032,0,-0.00032394,-0.00020563,-0.00020383,229,2017015,10:42:46,39
Here is a way you can solve this with tidyverse tools.
readr for reading/writing csv files
dplyr for data frame manipulation
stringr for string manipulation
library(readr)
library(dplyr)
library(stringr)
df_1 <- read_csv("test1.csv", col_names = FALSE, col_types = cols(.default = "c"), skip = 3)
First remove the rows with all values missing, or all but the last, and rows with the extra headers.
Then create a new column with the ES or LU values, otherwise NA, and then use tidyr::fill to fill down those values.
Then change the two columns with NONE to DATE and TIME because later we don't want two columns with the same name.
df_2 <- df_1 %>%
filter(!is.na(X1), !str_detect(X1, "INTTIME")) %>%
mutate(grp = if_else(X1 %in% c("ES", "LU"), X1, NA_character_)) %>%
tidyr::fill(grp, .direction = "down") %>%
mutate(X7 = str_replace(X7, "NONE", "DATE"),
X8 = str_replace(X8, "NONE", "TIME"))
df_2
#> # A tibble: 15 x 10
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 grp
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ES DELAY 344.83 348.23 351.62 SUM DATE TIME COUNTS ES
#> 2 0.032 0 0.35441789 -0.00060208 0.10290995 87 2017015 10:42:39 1 ES
#> 3 0.032 0 -0.36023974 -0.22242269 -0.09639 109 2017015 10:42:40 10 ES
#> 4 0.032 0 0.07552711 0.01524224 -0.16756855 91 2017015 10:42:48 41 ES
#> 5 LU DELAY 344.37 347.75 351.13 SUM DATE TIME COUNTS LU
#> 6 0.032 0 0.02288441 0.02891912 0.03595322 53 2017015 10:42:38 2 LU
#> 7 0.032 0 -0.00014323 0.00024047 0.00001585 212 2017015 10:42:38 6 LU
#> 8 0.032 0 0.00114258 0.00091736 -0.0000495 16 2017015 10:42:39 9 LU
#> 9 0.032 0 0.00020744 0.0004186 0.00027721 118 2017015 10:42:40 16 LU
#> 10 ES DELAY 344.83 348.23 351.62 SUM DATE TIME COUNTS ES
#> 11 0.032 0 56.7600789 59.43147464 62.83968564 186 2017015 10:42:38 3 ES
#> 12 0.032 0 56.27202003 59.52654061 62.86815706 29 2017015 10:42:38 4 ES
#> 13 LU DELAY 344.37 347.75 351.13 SUM DATE TIME COUNTS LU
#> 14 0.032 0 -0.00011611 -0.00039544 -0.00014584 3 2017015 10:42:42 20 LU
#> 15 0.032 0 -0.00032394 -0.00020563 -0.00020383 229 2017015 10:42:46 39 LU
Now for each of ES and LU you can filter to just those records, then remove the new grp column, then use the first row as the column names, then remove those column header rows, and write to a new cleaned csv file.
df_es <- df_2 %>%
filter(grp == "ES") %>%
select(-grp) %>%
purrr::set_names(., .[1,]) %>%
filter(ES != "ES") %>%
write_csv("ES.csv")
df_es
#> # A tibble: 5 x 9
#> ES DELAY `344.83` `348.23` `351.62` SUM DATE TIME COUNTS
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 0.032 0 0.35441789 -0.00060208 0.10290995 87 2017015 10:42:39 1
#> 2 0.032 0 -0.36023974 -0.22242269 -0.09639 109 2017015 10:42:40 10
#> 3 0.032 0 0.07552711 0.01524224 -0.16756855 91 2017015 10:42:48 41
#> 4 0.032 0 56.7600789 59.43147464 62.83968564 186 2017015 10:42:38 3
#> 5 0.032 0 56.27202003 59.52654061 62.86815706 29 2017015 10:42:38 4
df_lu <- df_2 %>%
filter(grp == "LU") %>%
select(-grp) %>%
set_names(., .[1,]) %>%
filter(LU != "LU") %>%
write_csv("LU.csv")
df_lu
#> # A tibble: 6 x 9
#> LU DELAY `344.37` `347.75` `351.13` SUM DATE TIME COUNTS
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 0.032 0 0.02288441 0.02891912 0.03595322 53 2017015 10:42:38 2
#> 2 0.032 0 -0.00014323 0.00024047 0.00001585 212 2017015 10:42:38 6
#> 3 0.032 0 0.00114258 0.00091736 -0.0000495 16 2017015 10:42:39 9
#> 4 0.032 0 0.00020744 0.0004186 0.00027721 118 2017015 10:42:40 16
#> 5 0.032 0 -0.00011611 -0.00039544 -0.00014584 3 2017015 10:42:42 20
#> 6 0.032 0 -0.00032394 -0.00020563 -0.00020383 229 2017015 10:42:46 39

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources