reshaping a large data frame from wide to long in R - r

I've been through the various reshape questions but don't believe this iteration has been asked before. I am dealing with a data frame of 81K rows and 4188 variables. Variables 161:4188 are the measurements present as different variables. The idvar is in column 1. I want to repeat columns 1:160 and create new records for columns 169:4188. The final data frame will be of the dimension 162 columns and 326,268,000 rows (81K * 4028 variables converted as unique records).
Here is what I tried:
reshapeddf <- reshape(c, idvar = "PID", varying = c(dput(names(c[161:4188]))),
v.names = "viewership",
timevar = "network.show",
times = c(dput(names(c[161:4188]))),
direction = "long")
The operation didn't complete. I waited almost 10 minutes. Is this the right way? I am on a Windows 7, 8GB RAM, i5 3.20ghz PC. What is the most efficient way to complete this transpose in R? Both of the answers by BondedDust and Nick are clever but I run into memory issues. Is there a way any of the three approaches in this thread- reshape, tidyr or the do.call be implemented using ff?
In Example Data below, columns 1:4 are the ones I want to repeat and columns 5:9 are the ones to create new records for.
structure(list(PID = c(1003401L, 1004801L, 1007601L, 1008601L,
1008602L, 1011901L), HHID = c(10034L, 10048L, 10076L, 10086L,
10086L, 10119L), HH.START.DATE = structure(c(1378440000, 1362974400,
1399521600, 1352869200, 1352869200, 1404964800), class = c("POSIXct",
"POSIXt"), tzone = ""), VISITOR.CODE = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("0", "L"), class = "factor"), WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI = c(0,
0, 305892, 0, 101453, 0), WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH = c(0,
0, 0, 0, 127281, 0), WEIGHTED.MINUTES.VIEWED..ABC...CASTLE = c(0,
27805, 0, 0, 0, 0), WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS = c(0,
679148, 0, 0, 278460, 498972), WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS = c(0,
316448, 0, 0, 0, 0)), .Names = c("PID", "HHID", "HH.START.DATE",
"VISITOR.CODE", "WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI", "WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH",
"WEIGHTED.MINUTES.VIEWED..ABC...CASTLE", "WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS",
"WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS"), row.names = c(NA,
6L), class = "data.frame")

Might be as easy as something like this:
dat2 <- cbind(dat[1:4], stack( dat[5:length(dat)] )

I think this should work:
library(tidyr)
newdf <- gather(yourdf, program, minutes, -PID:-VISITOR.CODE)

Related

Removing Incorrect Labels within Tidyverse/ Limiting Actions of as_factor()

I'm working with British Election Study data. To be used in R, this first has to be converted from the .dta form provided, which I think puts labels on to a lot of variables. Most of the time this is useful, but I think a problem I've got is where this isn't the case.
Using as_factor() blindly converts all variables with labels to factors. Is there a way to specify that only certain vectors are converted ? i.e
new_df <- data %>%
as_factor(just_this_column)
Failing that, is there a good way to remove the labels of certain variables within a dataframe ? I've kooked at the sjlabelled package but this does something weird and converts the data from a dataframe:
example_data<- str(sjlabelled::remove_all_labels(example_data$generalElectionVoteW19))
The reason I'm trying to do all of this is to make a histogram of number of people voting for each party (the factor) at a certain age. In this dataset, the age variable has a label which is messing up the code.
Of course, I could just convert the factor to a numeric value at the end but this seems like a messy way of achieving things !
Here is the dput:
structure(list(ageW19 = structure(c(72, 52, 39, 75, 26, 56), label = "Age", format.stata = "%8.0g", labels = c(`Not Asked` = -9,
Skipped = -8), class = c("haven_labelled", "vctrs_vctr", "double"
)), generalElectionVoteW19 = structure(c(1, 13, 3, 1, 2, 1), label = "General election vote intention (recalled vote in post-election waves)", format.stata = "%40.0g", labels = c(`I would/did not vote` = 0,
Conservative = 1, Labour = 2, `Liberal Democrat` = 3, `Scottish National Party (SNP)` = 4,
`Plaid Cymru` = 5, `United Kingdom Independence Party (UKIP)` = 6,
`Green Party` = 7, `British National Party (BNP)` = 8, Other = 9,
`Change UK- The Independent Group` = 11, `Brexit Party` = 12,
`An independent candidate` = 13, `Don't know` = 9999), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), na.action = c(`1` = 1L, `3` = 3L, `5` = 5L
))
To your first questions, you need mutate to convert a single column, e.g.
new_df <- data %>%
mutate(factor_column = as_factor(old column))
However, as you said you probably want to convert to numeric type, so you might want to use as.numeric instead of as_factor.
We may use base R
data$factor_column <- factor(data$old_column)

How to properly index list items to return rows, not columns, inside a for loop

I'm trying to write a for loop within another for loop. The first loop grabs the ith vcov matrix from a list of variously sized matrices (vcmats below) and grabs a frame of 24 predictor models of appropriate dimension to multiply with the current vcov matrix from a list of frames (jacobians below) for the different models. The second loop should pull the jth record (row) from the selected predictor frame, correctly format it, then run the calculation with the vcov matrix and output an indicator variable and calculated result needed for post processing to the holding table (holdtab).
When I run the code below I get the following error: Error in jjacob[, 1:4] : incorrect number of dimensions because R is returning the column of 1s (i.e. the intercept column of jacobs), not the complete first record (i.e. jjacob = jacobs[1,]). I've substantially simplified the example but left enough complexity to demonstrate the problem. I would appreciate any help in resolving this issue.
vcmats <- list(structure(c(0.67553, -0.1932, -0.00878, -0.00295, -0.00262,
-0.00637, -0.1932, 0.19988, 0.00331, -0.00159, 0.00149, 2e-05,
-0.00878, 0.00331, 0.00047, -6e-05, 3e-05, 3e-05, -0.00295, -0.00159,
-6e-05, 0.00013, -2e-05, 6e-05, -0.00262, 0.00149, 3e-05, -2e-05,
2e-05, 0, -0.00637, 2e-05, 3e-05, 6e-05, 0, 0.00026), .Dim = c(6L,
6L)), structure(c(0.38399, -0.03572, -0.00543, -0.00453, -0.00634,
-0.03572, 0.10912, 0.00118, -0.00044, 0.00016, -0.00543, 0.00118,
0.00042, -3e-05, 4e-05, -0.00453, -0.00044, -3e-05, 0.00011,
5e-05, -0.00634, 0.00016, 4e-05, 5e-05, 0.00025), .Dim = c(5L,
5L)))
jacobians <- list(structure(list(intcpt = c(1, 1, 1, 1), species = c(1, 1,
0, 0), nage = c(6, 6, 6, 6), T = c(12, 50, 12, 50), hgt = c(90,
90, 90, 90), moon = c(7, 7, 7, 7), hXm = c(0, 0, 0, 0), covr = c(0,
0, 0, 0), het = c(0, 0, 0, 0)), .Names = c("intcpt", "species",
"nage", "T", "hgt", "moon", "hXm", "covr", "het"), row.names = c("1",
"1.4", "1.12", "1.16"), class = "data.frame"), structure(list(
intcpt = c(1, 1, 1, 1), species = c(1, 1, 0, 0), nage = c(6,
6, 6, 6), T = c(12, 50, 12, 50), hgt = c(0, 0, 0, 0), moon = c(7,
7, 7, 7), hXm = c(0, 0, 0, 0), covr = c(0, 0, 0, 0), het = c(0,
0, 0, 0)), .Names = c("intcpt", "species", "nage", "T", "hgt",
"moon", "hXm", "covr", "het"), row.names = c("2", "2.4", "2.12",
"2.16"), class = "data.frame"))
holdtab <- structure(list(model = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L), .Label = c("M.1.BaseCov", "M.2.Height", "M.5.Height.X.LastNewMoon",
"M.6.Height.plus.LastNew", "M.7.LastNewMoon", "M.G.Global"), class = "factor"),
aicc = c(341.317, 341.317, 341.317, 341.317, 342.1412, 342.1412,
342.1412, 342.1412), species = c(NA, NA, NA, NA, NA, NA,
NA, NA), condVar = c(NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("model",
"aicc", "species", "condVar"), row.names = c(1L, 2L, 3L, 4L,
25L, 26L, 27L, 28L), class = "data.frame")
jloop <- 1
for (imat in vcmats) { # Call the outside loop of vcov matrices
jacobs = jacobians[[jloop]] # Set tempvar jacobs as the jth member of the jacobians frame (n/24)
for (jjacob in jacobs) { # Call inside loop of lines in jacob (each individual set of predictor levels)
# I need to reduce the vector length to match my vcov matrix so
pt1 = jjacob[,1:4] # Separate Core columns from variable columns (because I don't want to drop species when ==0)
pt2 = jjacob[,5:9] # Pull out variable columns for next step
pt2 = pt2[,!apply(pt2 == 0, 2, all)] # Drop any variable columns that ==0
jjacob = cbind(pt1, pt2) # Reconstruct the record now of correct dimensions for the relevant vcov matrix
jjacob = as.matrix(jjacob) # Explicitly convert jjmod - I was having trouble with this previously
tj = (t(jjacob)) # Transpose the vector
condvar = jjacob %*% imat %*% tj # run the calculation
condVarTab[record,3] = jjacob[2] # Write species 0 or 1 to the output table
condVarTab[record,4] = condvar # Write the conditional variance to the table
record = record+1 # Iterate the record number for the next output run
}
jloop = jloop+1 # Once all 24 models in a frame are calculated iterate to the next frame of models which will be associated with a new vcv matrix
}

Calculating the median of a time series, by 8 every 8 hours

I am new to R and I do have to calculate the mean of time series, containing 5 years, with hourly taken data of ozon etc..
My df looks like:
structure(list(date = structure(c(1L, 1L, 1L, 1L), .Label = "01.01.2010", class = "factor"),
day.of = c(1L, 1L, 1L, 1L), time = structure(1:4, .Label = c("00:00",
"01:00", "02:00", "03:00"), class = "factor"), SVF_Ray = c(1L,
1L, 1L, 1L), Gmax = c(0, 0, 0, 0), Ta = c(-1.3, -1.2, -1.2,
-1.2), Tmrt = c(-19.3, -12.1, -12, -12.1), PET = c(-10.4,
-8.7, -8.7, -8.7), PT = c(-11.3, -9.3, -9.3, -9.3), Ozon = c(61.35,
62.65, 63.4, 63.85), rDatum = structure(c(14610, 14610, 14610,
14610), class = "Date"), year = c(2010, 2010, 2010, 2010),
month = c(1, 1, 1, 1), day = c(1, 1, 1, 1), hour = c(0, 1,
2, 3)), .Names = c("date", "day.of", "time", "SVF_Ray", "Gmax",
"Ta", "Tmrt", "PET", "PT", "Ozon", "rDatum", "year", "month",
"day", "hour"), row.names = c(NA, 4L), class = "data.frame")
I would like to calculate the mean of Ozon every 8 hours, so a series of 4 calculated means for every day. I have arranged my datum like:
Datum_Ozon$rDatum <- as.Date(data$date, format="%d.%m.%Y")
Datum_Ozon$hour<-as.numeric(unlist(strsplit(as.character(df$time), ":"))[seq(1, 2 * length(df$time), 2)])
Format is numeric
But I don't know any further in achieving my goal. Thanks in advance!
If its the case that your data is regular and complete (ie, every hour has a record), the following base R code should do the trick:
# Get the number of 8 hour intervals
intervalCnt <- nrow(df) / 8L
# add a grouping vector to your data
df$group <- rep(1:intervalCnt, each=8)
# get the median for each interval, keep year var around for later
intervalMedian <- aggregate(var~group + day + month + year, data=df, FUN=median)
Note that this solution relies on the assumption that the data has a regular structure, i.e., every hour has a record. If the measure of interest is missing, i.e. NA, then simply adding na.rm to the aggregate function will return the statistics of interest:
# get the median for each interval
intervalMedian <- aggregate(var~group + day + month + year, data=df, FUN=median, na.rm=T)
If you have a variable for hour of the day, here is a simple way to check for data regularity:
table(df$hourOfDay)
The result of this function is a frequency count of each hour. The counts should be equal. Another thing to check is that the first observation starts in the hour following the final observation, i.e. if the hour of observation 1 == "00:00", then the hour of the final observation should be 23:00.
To provide a plot of the mean of the 8 hour periods by year, you can again use aggregate:
intervalMeans.year <- aggregate(var~group, data=intervalMedian,
FUN=mean, na.rm=T)
The inclusion of the group, day, month, and year variables in the intervalMedian data.frame allow for a lot of different aggregations. For example, with a minor adjustment, it is possible to get the average value of a variable over the 5 year period for each time period-day-month:
intervalMedian$periodDay <- rep(1:3, length.out=intervalMedian)
intervalMeans.dayMonthPeriod <- aggregate(var~periodDay+day+month,
data=intervalMedian, FUN=mean, na.rm=T)
Here is a basic example using a dplyr pipe rather than a plyr approach as well as ifelse(). Everything is self contained here:
library(dplyr)
## OP data
df <-
structure(list(date = structure(c(1L, 1L, 1L, 1L), .Label = "01.01.2010", class = "factor"),
day.of = c(1L, 1L, 1L, 1L), time = structure(1:4, .Label = c("00:00",
"01:00", "02:00", "03:00"), class = "factor"), SVF_Ray = c(1L,
1L, 1L, 1L), Gmax = c(0, 0, 0, 0), Ta = c(-1.3, -1.2, -1.2,
-1.2), Tmrt = c(-19.3, -12.1, -12, -12.1), PET = c(-10.4,
-8.7, -8.7, -8.7), PT = c(-11.3, -9.3, -9.3, -9.3), Ozon = c(61.35,
62.65, 63.4, 63.85), rDatum = structure(c(14610, 14610, 14610,
14610), class = "Date"), year = c(2010, 2010, 2010, 2010),
month = c(1, 1, 1, 1), day = c(1, 1, 1, 1), hour = c(0, 1,
2, 3)), .Names = c("date", "day.of", "time", "SVF_Ray", "Gmax",
"Ta", "Tmrt", "PET", "PT", "Ozon", "rDatum", "year", "month",
"day", "hour"), row.names = c(NA, 4L), class = "data.frame")
df %>%
mutate(DayChunk=ifelse(hour %in% c(0:7),"FirstThird",
ifelse(hour %in% c(8:15), "SecondThird"
,"ThirdThird")
)) %>%
group_by(Date, DayChunk) %>%
summarise(MedOzon=median(Ozon))
Look up the function seq.POSIXt. There are options to specify the start and stop intervals. This function is designed to create sequences of time. For your problem:
myseq<-seq(ISOdate(2010,01,01, 00, 00, 00, tz="GMT"), to=ISOdate(2016,01,05), by = "8 hour")
Use the ISOdate functions to set the start and stop times. If you are going to be working much with times, I suggest researching the function strptime and the POSIXlt/ct time classes.
Now with the breaks defined and assuming you have a column in your dataframe (Datum_Ozon) named "datetime", then use "cut" to group/subset your data.
Datum_Ozon$datetime<-as.POSIXct(paste(as.character(Datum_Ozon$date),
as.character(Datum_Ozon$time)), "%d.%m.%Y %H:%M", tz="GMT" )
library(dplyr)
summarize(group_by(Datum_Ozon, cut(Datum_Ozon$datetime, myseq)), mean(Ozon))

Improvement in for loop using other method

Problem
There is 1 main station (df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication.
Any suggestions on improving this code (e.g. data.table, dplyr, apply)? I still don't know how to approach this problem without the cumbersome for loop.
dput
s <- structure(list(id = c("USC00031152", "USC00034638", "USC00036352",
"USC00031152", "USC00034638", "USC00036352", "USC00031152", "USC00034638",
"USC00036352"), lat = c(33.59, 34.7392, 35.2833, 33.59, 34.7392,
35.2833, 33.59, 34.7392, 35.2833), long = c(-92.8236, -90.7664,
-93.1, -92.8236, -90.7664, -93.1, -92.8236, -90.7664, -93.1),
year = c(1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 1), day = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), value = c(63.3157576809045,
86.0490598902219, 76.506386949066, 71.3760752788486, 89.9119576975542,
76.3535163951321, 53.7259645981243, 61.7989638892985, 85.8911224149051
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L), .Names = c("id", "lat", "long", "year", "month", "day",
"value"))
df <- structure(list(id = c(12345, 12345, 12345), lat = c(100, 100,
100), long = c(50, 50, 50), year = c(1900, 1900, 1900), month = c(1,
1, 1), day = 1:3, value = c(54.8780020601509, 106.966029162171,
98.3198828955801)), row.names = c(NA, -3L), class = "data.frame", .Names = c("id",
"lat", "long", "year", "month", "day", "value"))
Code
library(phylin)
nearest <- function(i, loc){
# Stack 3 local stations
stack <- s[loc:(loc+2),]
# Get 1 main station
station <- df[i,]
# Check for NA and build relative anomaly (r)
stack <- stack[!is.na(stack$value),]
stack$r <- stack$value/station$value
# Use IDW and return v
v <- as.numeric(ifelse(dim(stack)[1] == 1,
stack$r,
idw(stack$r, stack[,c(2,3,8)], station[,2:3])))
return(v)
}
ncdc <- 1
for (i in 1:nrow(df)){
# Get relative anomaly from function
r <- nearest(i, ncdc)
# Get value from main station and apply anomaly
p <- df[i,7]
df[i,7] <- p*r
# Iterate to next 3 local stations
ncdc <- ncdc + 3
}
Suppose you let your nearest function unchanged.
Then you could get your new value column in df by
newvalue <- sapply(1:NROW(df), function (i) df[i,7] * nearest(i, 3*(i-1)+1))
df$value <- newvalue

Get the a unique median value out of list of several objects.

I have this list of 3 objects which contain biomass values. I want to extract one unique value out this list which correspond to the median of all the forest biomass values inside the 3 objects. I think it is pretty straightforward but somehow could not manage to get there. Can someone help me out whith that? Thanks for your help.
dput(x)
list(structure(c(3.37515461444855, 5.19044327735901, 3.22319519519806,
5.68365132808685, 2.36871695518494, 2.36871695518494, 3.63608360290527,
2.99963092803955, 10.2748856544495, 10.2748856544495, 16.4309034347534,
22.3492307662964, 12.4613256454468, 0, 2.03538191318512, 1.07113289833069,
21.3975343704224, 15.1670708656311, 4.22249209880829, 7.37385129928589,
14.4166820049286, 14.3547036647797, 0, 7.37385129928589, 7.05217242240906,
7.05217242240906, 3.68692564964294, 7.05217242240906, 6.73049354553223,
7.05217242240906, 3.67388677597046, 25.9837236404419, 45.9836235046387,
33.4825744628906, 1.5435653924942, 10.1114643216133, 45.6102886199951,
10.1114643216133, 31.958158493042, 45.2369537353516, 45.2369537353516,
18.6793632507324, 18.6793632507324, 21.7280540466309, 19.710410118103
), .Dim = c(45L, 1L), .Dimnames = list(NULL, "Forest_Biomass_2000")),
structure(c(14.4797344207764, 2.04780006408691, 0, 0, 13.7020168304443,
0, 0, 0.32373720407486, 22.9602508544922, 11.6327629089355,
0, 4.97857093811035, 5.25019407272339), .Dim = c(13L, 1L), .Dimnames = list(
NULL, "Forest_Biomass_2000")), structure(NA_real_, .Dim = c(1L,
1L), .Dimnames = list(NULL, "Forest_Biomass_2000")))
How about
median(unlist(x), na.rm = TRUE)
#[1] 7.052172
Edit after comment
y <- unlist(x)
q <- quantile(y, na.rm = TRUE)
median(y[y > q[4]], na.rm = TRUE)
#[1] 22.96025
Is that what you mean?

Resources