how to reshape the matrix and fill the missing value as 0 - r

I have a question about matrix structure manipulation in R, here I need to first transpose the matrix and combine the month and status columns, filling the missing values with 0. Here I have an example, currently my data is like belows. It seems very tricky. I would appreciate if anyone could help on this. Thank you.
Hi, my data looks like the follows:
structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
I need to have the reshaped matrix with the following columns:
Customer Phase1earlyTotal Phase2earlyTotal....Phase4earlyTotal...Phase1_ Ontimetotal...Phase4_Ontimetotal...Phase1LateTotal_Phase4LateTotal. For example Phase1earlytotal includes the sum of the amount with the Phase=1 and Status=Early.
Currently I use the following scripts, which does not work, coz I dont know
how to combine Phase and Stuatus Column.
mydata2<-data.table(mydata2,V3,V4)
mydata2$V4<-NULL
datacus <- data.frame(mydata2[-1,],stringsAsFactors = F);
datacus <- datacus %>% mutate(Phase= as.numeric(Phase),Amount=
as.numeric(Amount)) %>%
complete(Phase = 1:4,fill= list(Amount = 0)) %>%
dcast(datacus~V3, value.var = 'Amount',fill = 0) %>% select(Phase, V3)
%>%t()

I believe you are looking for somethink like this?
sample data
df <- structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
# Customer Phase Status Amount
# 1: 1096261 2 Ontime 21216.32
# 2: 1096261 3 Ontime 42432.65
# 3: 1169502 1 Ontime 200320.05
# 4: 1169502 2 Ontime 84509.24
code
library( data.table )
dcast( setDT( df ), Customer ~ Phase + Status, fun = sum, value.var = "Amount" )[]
output
# Customer 1_Ontime 2_Ontime 3_Ontime
# 1: 1096261 0 21216.32 42432.65
# 2: 1169502 200320 84509.24 0.00

Related

Pick the corresponding row position and name

I have a list of values and their corresponding row position provided as id. Essentially, given a vector of names I want to grab the row position from the list and assign the name it's from as a column. I can achieve the first part, but I cannot assign the names accordingly.
For example
predictors <- c('status', 'verbal')
row_predictor_value <-
lapply(row_predictor_data, function(x)
which(x$name %in% predictors, arr.ind = TRUE) %>% setNames(., predictors))
Produces the following result:
[[1]]
status verbal
2 4
[[2]]
status verbal
2 4
[[3]]
status verbal
2 4
However, this assigns the wrong name from where I got it.
It should produce instead:
[[1]]
status verbal
2 4
[[2]]
verbal status
2 4
[[3]]
status verbal
2 4
Here's some example data:
row_predictor_data <- list(structure(list(id = structure(1:4, .Label = c("1", "2",
"3", "4"), class = "factor"), name = structure(c(1L, 3L, 2L,
4L), .Label = c("(Intercept)", "income", "status", "verbal"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(id = structure(1:4, .Label = c("1", "2",
"3", "4"), class = "factor"), name = structure(c(1L, 4L, 2L,
3L), .Label = c("(Intercept)", "sex", "status", "verbal"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(id = structure(1:5, .Label = c("1", "2",
"3", "4", "5"), class = "factor"), name = structure(c(1L, 4L,
2L, 5L, 3L), .Label = c("(Intercept)", "income", "sex", "status",
"verbal"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L)))
The issue with which and %in% is that it returns the positions without differentiating the order of 'predictors', thus if the order is different as in the second list element, when we use a fixed 'predictors' to assign as names, this gets the wrong result. Instead, use the 'names' column to generate the names of the vector
lapply(row_predictor_data, function(x)
with(subset(x, name %in% predictors),
setNames(as.character(id), name)))
NOTE: Here we return a named vector of 'id's

How do I change the names of columns in multiple dataframes using a mapping file in R?

I have a script that loops through multiple years of data, one year at a time. Each year of data consists of multiple dataframes that are placed in a list called all_input. At the beginning of the loop (after the data is read in), I am trying to get all of the years of data in the same format before the rest of the processing.
The issue I am having is with column names, which are not uniform.
There are 5 columns included in each dataframe that I want to keep, and I want them to be called total_emissions uom tribal_name st_usps_cd and description. In some dataframes they already have these names, while in others they have various names such as pollutant.desc or pollutant_desc, for example.
My current approach is this:
# Create a mapping file for the column names
header_map <- data.frame(orignal_col = c( "pollutant_desc", "pollutant.desc", "emissions.uom", "total.emissions", "tribal.name", "state" ),
new_col = c( "description", "description", "uom", "total_emissions", "tribal_name", "st_usps_cd" ), stringsAsFactors = FALSE)
# change the column names
lapply(all_input, function(x) {
names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col
x
}) -> all_input
Which creates a header mapping file that looks like this:
original_col new_col
pollutant_desc description
pollutant.desc description
emissions.uom uom
total.emissions total_emissions
tribal.name tribal_name
state st_usps_cd
The error I am getting is as follows:
Error in names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col :
NAs are not allowed in subscripted assignments
I understand that as I will have to manually add entries to the header file as new years of data with different column names are processed, but how can I get this to work?
Fake Sample Data. df1 and df2 represent the format of the "2017" data, where multiple columns need name changes, but the current names are consistent between dataframes. df3 represents "2011" data, where all of the column names are as they should be. df4 represents "2014" data, where the only column that needs to be changed is pollutant_desc. Note, there are extra columns in each dataframe that are not needed and can be ignored. And reminder, these dataframes are not all read at the same time. The loop is by year, so df1 and df2 (in list all_input) will be formatted and processed. Then all of the data is removed, and a new all_input list is created with the next years dataframes, which will have different column names. The code must work for all years without being changed.
> dput(df1)
structure(list(total.emissions = structure(1:2, .Label = c("100",
"300"), class = "factor"), emissions.uom = structure(1:2, .Label = c("LB",
"TON"), class = "factor"), international = c(TRUE, TRUE), hours = structure(2:1, .Label = c("17",
"3"), class = "factor"), tribal.name = structure(2:1, .Label = c("FLLK",
"SUWJG"), class = "factor"), state = structure(1:2, .Label = c("AK",
"MN"), class = "factor"), pollutant.desc = structure(1:2, .Label = c("Methane",
"NO2"), class = "factor"), policy = c(TRUE, FALSE)), class = "data.frame", row.names = c(NA,
-2L))
> dput(df2)
structure(list(total.emissions = structure(2:1, .Label = c("20",
"400"), class = "factor"), emissions.uom = structure(c(1L, 1L
), .Label = "TON", class = "factor"), international = c(FALSE,
TRUE), hours = structure(2:1, .Label = c("1", "8"), class = "factor"),
tribal.name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"),
state = structure(2:1, .Label = c("SD", "WY"), class = "factor"),
pollutant.desc = structure(1:2, .Label = c("CO2", "SO2"), class = "factor"),
policy = c(FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-2L))
> dput(df3)
structure(list(total_emissions = structure(2:1, .Label = c("200",
"30"), class = "factor"), uom = structure(c(1L, 1L), .Label = "TON", class = "factor"),
boundaries = structure(2:1, .Label = c("N", "Y"), class = "factor"),
tribal_name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"),
st_usps_cd = structure(2:1, .Label = c("ID", "KS"), class = "factor"),
description = structure(c(1L, 1L), .Label = "SO2", class = "factor"),
policy = c(FALSE, TRUE), time = structure(1:2, .Label = c("17",
"7"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(df4)
structure(list(total_emissions = structure(2:1, .Label = c("700",
"75"), class = "factor"), uom = structure(c(1L, 1L), .Label = "LB", class = "factor"),
tribal_name = structure(1:2, .Label = c("SSJY", "WNCOPS"), class = "factor"),
st_usps_cd = structure(1:2, .Label = c("MO", "NY"), class = "factor"),
pollutant_desc = structure(2:1, .Label = c("CO2", "Methane"
), class = "factor"), boundaries = structure(c(1L, 1L), .Label = "N", class = "factor"),
policy = c(FALSE, FALSE), time = structure(1:2, .Label = c("2",
"3"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Thank you!
Try this:
list_of_frames1 <- list(df1, df2, df3, df4)
list_of_frames2 <- lapply(list_of_frames1, function(x) {
nms <- intersect(names(x), header_map$orignal_col)
names(x)[ match(nms, names(x)) ] <- header_map$new_col[ match(nms, header_map$orignal_col) ]
x
})

Summarize dataframe with start and end times in R?

Here is a sample of my df:
structure(list(press_id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
start_time = c(164429106370978, 164429106370978, 164429411618824,
164429411618824, 164429837271939, 164429837271939, 164430399454284,
164430399454284), end_time = c(164429182443824, 164429182443824,
164429512525747, 164429512525747, 164429903243169, 164429903243169,
164430465927554, 164430465927554), timestamp = c(164429140697138,
164429175921880, 164429440899844, 164429440899844, 164429867184830,
164429891199391, 164430427558256, 164430433561155), acc_x = c(3.1053743,
2.9904492, 5.889916, 5.889916, 5.808511, 5.36557, 3.545921,
3.4788814), acc_y = c(8.406299, 8.12138, 8.600235, 8.600235,
7.920261, 7.922655, 7.9346266, 7.972935), acc_z = c(4.577853,
4.0894213, 0.35435268, 0.35435268, -0.21309046, 0.46927786,
4.005622, 4.4198313), grav_x = c(3.931084, 4.0214577, 4.7844357,
4.7844357, 5.6572776, 5.65053, 3.9938855, 3.9938855), grav_y = c(8.318872,
8.281514, 8.21449, 8.21449, 7.94851, 7.9495893, 8.027369,
8.027369), grav_z = c(3.393116, 3.3785365, 2.408623, 2.408623,
0.99327636, 1.0226398, 3.9724596, 3.9724596), gyro_x = c(-0.35906965,
0.099690154, 0.06792516, 0.04532315, -0.05546962, -0.06524346,
-0.2967614, -0.32180685), gyro_y = c(0.15843217, -0.48053285,
-0.2196934, -0.21175216, 0.1895863, 0.37467846, 0.12239113,
0.04847643), gyro_z = c(-0.042139318, 0.39585108, 0.12523776,
0.11240959, -0.05863268, 0.042770952, 0.047047008, 0.097137965
), acc_mag = c(10.0630984547559, 9.5719886173707, 10.4297995361418,
10.4297995361418, 9.82419166595324, 9.58008483176486, 9.56958006531909,
9.75731607717771), acc_mag_max = c(10.4656808698978, 10.4656808698978,
10.5978974240054, 10.5978974240054, 10.2717799984467, 10.2717799984467,
10.0054693945119, 10.0054693945119), acc_mag_min = c(9.55048847884876,
9.55048847884876, 9.45791784630329, 9.45791784630329, 9.58008483176486,
9.58008483176486, 9.49389444102469, 9.49389444102469), acc_mag_avg = c(9.9181794947982,
9.9181794947982, 9.82876220923978, 9.82876220923978, 9.89351246166363,
9.89351246166363, 9.77034322149792, 9.77034322149792), vel_ang_mag = c(0.394724572535758,
0.630514095219792, 0.261846355511019, 0.243985821544114,
0.206052505577139, 0.382714007838398, 0.324438496782347,
0.339625377757329), vel_ang_mag_max = c(0.665292823798622,
0.665292823798622, 1.00730683166191, 1.00730683166191, 0.561349818527019,
0.561349818527019, 0.445252333070234, 0.445252333070234),
vel_ang_mag_min = c(0.212944405199931, 0.212944405199931,
0.18680382123856, 0.18680382123856, 0.111795327479332, 0.111795327479332,
0.258342546774667, 0.258342546774667), vel_ang_mag_avg = c(0.440700089033948,
0.440700089033948, 0.405484992593493, 0.405484992593493,
0.284553957549617, 0.284553957549617, 0.348811700631375,
0.348811700631375)), .Names = c("press_id", "start_time",
"end_time", "timestamp", "acc_x", "acc_y", "acc_z", "grav_x",
"grav_y", "grav_z", "gyro_x", "gyro_y", "gyro_z", "acc_mag",
"acc_mag_max", "acc_mag_min", "acc_mag_avg", "vel_ang_mag", "vel_ang_mag_max",
"vel_ang_mag_min", "vel_ang_mag_avg"), row.names = c(NA, -8L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "press_id", drop = TRUE, indices = list(
0:1, 2:3, 4:5, 6:7), group_sizes = c(2L, 2L, 2L, 2L), biggest_group_size = 2L, labels = structure(list(
press_id = 1:4), row.names = c(NA, -4L), class = "data.frame", vars = "press_id", drop = TRUE, indices = list(
0:1, 2:3, 4:5, 6:7), group_sizes = c(2L, 2L, 2L, 2L), biggest_group_size = 2L, labels = structure(list(
press_id = 1:4), row.names = c(NA, -4L), class = "data.frame", vars = "press_id", drop = TRUE, .Names = "press_id"), .Names = "press_id"))
And I am trying to summarize it in the following way where the last columns(the blank are filled with their appropriate values from above dataframe):
press_id time_state time_state_val acc_mag acc_mag_max acc_mag_min acc_mag_avg vel_ang_mag vel_ang_mag_max vel_ang_mag_min vel_ang_mag_avg
1 start_time 164429106370978
1 end_time 164429182443824
2 start_time 164429411618824
2 end_time 164429512525747
3 start_time 164429837271939
3 end_time 164429903243169
4 start_time 164430399454284
4 end_time 164430427558256
Please advise how can I transform it to be like expected result.
I am trying to do this with combination of tidyr gather and dplyr but I don't get the structure I need.
library(dplyr)
library(tidyr)
df1 <- df[,1:6]
df1 %>% mutate(row=row_number()) %>%
gather(time_state , time_state_val, -press_id, -row,-timestamp:-acc_y) %>%
arrange(press_id, row) %>%
select(press_id, time_state, time_state_val, everything(),-row)

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

R normalize a dataset

I have a dataset that looks like this
> dput(events.seq)
structure(list(vid = structure(1L, .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
deltas = structure(1L, .Label = "38479,38488,38492,38775,45595,45602,45606,45987,50280,50285,50288,50646,54995,55001,55005,55317,59528,59533,59537,59921,63392,63403,63408,63822,66706,66710,66716,67002,73750,73755,73759,74158,77999,78003,78006,78076,81360,81367,81371,82381,93365,93370,93374,93872,154875,154878,154880,154880,155866,155870", class = "factor"),
events = structure(1L, .Label = "mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown,mouseup,click,mousemove,mousedown", class = "factor")), .Names = c("vid",
"deltas", "events"), class = "data.frame", row.names = c(NA,
-1L))
I need to normalize it to this structure:
> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L), .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5\n+ ", class = "factor"),
delta = c(38479, 38488, 38492), c..mousemove....mousedown....mousup.. = structure(c(2L,
1L, 3L), .Label = c("mousedown", "mousemove", "mousup"), class = "factor")), .Names = c("vid",
"delta", "c..mousemove....mousedown....mousup.."), row.names = c(NA,
-3L), class = "data.frame")
Any help appreciated.
I did try to use strplit, the problem us that I want to split twice at the same time on second and third columns (which are always sync in their length)
Try this:
z <- with(x, data.frame(
deltas = strsplit(as.character(deltas), split = ",")[[1]],
events = strsplit(as.character(events), ",")[[1]]
))
head(z)
The result:
deltas events
1 38479 mousemove
2 38488 mousedown
3 38492 mouseup
4 38775 click
5 45595 mousemove
6 45602 mousedown

Resources