I have a data frame that has annual data for population by MSA. They are organized as follows:
MSA FIPS x1969 x1970 x1971 .... x2012
Akron 123 12 14 17 .... 22
Miami 234 23 20 24 .... 29
etc.
I need to reshape the data into
MSA FIPS Year Data
Akron 123 1969 12
Akron 123 1970 14
Akron 123 1971 17
...
I can do this using "melt", but I also want to interpolate these annual data to include quarterly data points for the full time series. So, how best to create the quarterly (interpolated) matrix on the fly?
I can do this using a loop over the rows of the first matrix above and then use melt to reshape the new data, but I've been asked to slap myself anytime I catch myself building explicitly coded loops.
I've been tinkering with "apply", but it creates a list of lists -- which would then require assembling the final data frame.
I can feel a simple solution must be out there.
Thanks, Chris.
May be you could try td from tempdisagg
library(tempdisagg)
library(reshape2)
library(zoo)
dM <- transform(melt(df, id.var=c('MSA', 'FIPS')),
variable=as.numeric(gsub('^x', '', variable)))
res <- lapply(split(dM, dM$MSA), function(x) {
val <- ts(x$value, start=x$variable[1], end=x$variable[nrow(x)])
val2 <-predict(td(val~1, to='quarterly', method='uniform'))
#change the options as needed
data.frame(yearQtr= as.yearqtr(time(val2)), val=val2)})
data
df <- structure(list(MSA = c("Akron", "Miami"), FIPS = c(123L, 234L
), x1969 = c(12L, 23L), x1970 = c(14L, 20L), x1971 = c(17L, 24L
)), .Names = c("MSA", "FIPS", "x1969", "x1970", "x1971"), class = "data.frame",
row.names = c(NA, -2L))
This builds on #akrun earlier:
#His data frame build:
df <- structure(list(MSA = c("Akron", "Miami"), FIPS = c(123L, 234L),
x1969 = c(12L, 23L), x1970 = c(14L, 20L), x1971 = c(17L, 24L)),
.Names = c("MSA", "FIPS", "x1969", "x1970", "x1971"), class = "data.frame",
row.names = c(NA, -2L))
#His set up:
dM <- transform(melt(df, id.var=c('MSA', 'FIPS')),
variable=as.numeric(gsub('^x', '', variable)))
#My variation on his lapply:
res <- lapply(split(dM, dM$MSA), function(x) {
xseq=seq(min(x$variable),max(x$variable),by=.25)
val <- approx(x$variable,x$value,xout=xseq)
data.frame(yearQtr=xseq,val=val$y)})
df.new <- do.call(rbind.data.frame,res)
It's not quite perfect, but I'll get back to it later. We're close. Thank you #akrun
Related
I have two dataframes in R, recurrent and L1HS. I am trying to find a way to do this:
If a sequence in recurrent matches sequence in L1HS, paste a value from a column in recurrent into new column in L1HS.
The recurrent dataframe looks like this:
> head(recurrent)
chr start end X Y level unique
1: chr4 56707846 56708347 0 38 03 chr4_56707846_56708347
2: chr1 20252181 20252682 0 37 03 chr1_20252181_20252682
3: chr2 224560903 224561404 0 37 03 chr2_224560903_224561404
4: chr5 131849595 131850096 0 36 03 chr5_131849595_131850096
5: chr7 46361610 46362111 0 36 03 chr7_46361610_46362111
6: chr1 20251169 20251670 0 36 03 chr1_20251169_20251670
The L1HS dataset contains many columns containing genetic sequence basepairs and a column "Sequence" that should hopefully have some matches with "unique" in the recurrent data frame, like so:
> head(L1HS$Sequence)
"chr1_35031657_35037706"
"chr1_67544575_67550598"
"chr1_81404889_81410942"
"chr1_84518073_84524089"
"chr1_87144764_87150794"
I know how to search for matches using
test <- recurrent$unique %in% L1HS$Sequence
to get the Booleans:
> head(test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
But I have a couple of problems from here. If the sequence is found, I want to copy the "level" value from the recurrent dataset to the L1HS dataset in a new column. For example, if the sequence "chr4_56707846_56708347" from the recurrent data was found in the full-length data, I'd like the full-length data frame to look like:
Sequence level other_columns
chr4_56707846_56708347 03 gggtttcatgaccc....
I was thinking of trying something like:
for (i in L1HS){
if (recurrent$unique %in% L1HS$Sequence{
L1HS$level <- paste(recurrent$level[i])}
}
but of course this isn't working and I can't figure it out.
I am wondering what the best approach is here! I'm wondering if merge/intersect/apply might be easier/better, or just what best practice might look like for a somewhat simple question like this. I've found some similar examples for Python/pandas, but am stuck here.
Thanks in advance!
You can do a simple left_join to add level to L1HS with dplyr.
library(dplyr)
L1HS %>%
left_join(., recurrent %>% select(unique, level), by = c("Sequence" = "unique"))
Or with merge:
merge(x=L1HS,y=recurrent[, c("unique", "level")], by.x = "Sequence", by.y = "unique",all.x=TRUE)
Output
Sequence level
1 chr1_35031657_35037706 4
2 chr1_67544575_67550598 2
3 chr1_81404889_81410942 NA
4 chr1_84518073_84524089 3
5 chr1_87144764_87150794 NA
*Note: This will still retain all the columns in L1HS. I just didn't create any additional columns in the example data below.
Data
recurrent <- structure(list(chr = c("chr4", "chr1", "chr2", "chr5", "chr7",
"chr1"), start = c(56707846L, 20252181L, 224560903L, 131849595L,
46361610L, 20251169L), end = c(56708347L, 20252682L, 224561404L,
131850096L, 46362111L, 20251670L), X = c(0L, 0L, 0L, 0L, 0L,
0L), Y = c(38L, 37L, 37L, 36L, 36L, 36L), level = c(3L, 2L, 3L,
3L, 3L, 4L), unique = c("chr4_56707846_56708347", "chr1_67544575_67550598",
"chr2_224560903_224561404", "chr5_131849595_131850096", "chr1_84518073_84524089",
"chr1_35031657_35037706")), class = "data.frame", row.names = c(NA,
-6L))
L1HS <- structure(list(Sequence = c("chr1_35031657_35037706", "chr1_67544575_67550598",
"chr1_81404889_81410942", "chr1_84518073_84524089", "chr1_87144764_87150794"
)), class = "data.frame", row.names = c(NA, -5L))
I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.
Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))
You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1
I have a 2 dimensional data set (matrix/data frame) that looks like this
779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00
The 779, 489,859, 1156 are values that I want to draw on the x-axis
The rest of the values on the column are values that correpond to each x
Now I want to plot the entire data set, so that I have a graph with the the following points
(779,56916) , (779, 41784)......
(482,78968) , (482, 64440)..... and so on
The way I did it so far is like this (it gives me the plot I am looking for)
plot(colnames(resultsSummary),resultsSummary[1,],ylim=c(0,80000),pch=6)
points(colnames(resultsSummary),resultsSummary[2,],pch=3)
points(colnames(resultsSummary),resultsSummary[3,])
and so on..... plotting row by row
I am sure there is a better way to do it, but I dont know how, any suggestions?
DF <- read.table(text=" 779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00",
header=TRUE, check.names=FALSE)
m <- as.matrix(DF)
matplot(as.integer(colnames(m)),
t(m), pch=seq_len(ncol(m)))
Following also works:
ddf = structure(list(var = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("maxs",
"Mean_Cost", "Means_Stdv", "Means+Stdv", "mins"), class = "factor"),
X779 = c(56916, 41784.7, 31863.18, 21941.66, 21088), X482 = c(78968,
64440.83, 44407.4, 24373.97, 13768), X859 = c(51156, 38319.1,
29365.78, 20412.45, 24132), X1156 = c(44827.01, 42767.14,
38711.29, 34655.43, 31452)), .Names = c("var", "X779", "X482",
"X859", "X1156"), class = "data.frame", row.names = c(NA, -5L
))
ddf
var X779 X482 X859 X1156
1 maxs 56916.00 78968.00 51156.00 44827.01
2 Means+Stdv 41784.70 64440.83 38319.10 42767.14
3 Mean_Cost 31863.18 44407.40 29365.78 38711.29
4 Means_Stdv 21941.66 24373.97 20412.45 34655.43
5 mins 21088.00 13768.00 24132.00 31452.00
ddf[6,2:5]=as.numeric(substr(names(ddf)[2:5],2,4))
ddf2 = data.frame(t(ddf))
ddf2 = ddf2[-1,]
mm = melt(ddf2, id='X6')
ggplot(mm)+geom_point(aes(x=X6, y=value, color=variable))
I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")
I am new to R and I need to conduct a time-series, cross-sectional (TSCS) analysis in R (dynamic probit). I know how to run the model, but I need to tell R that I am dealing with TSCS data.
I have data for 44 countries (countries are both coded numerically and in character form in the data set), and for 52 years for each of these. E.g:
Angola 1950
Angola 1951
.
.
.
Benin 1950
Benin 1951
I have found the ts() command, but I am not sure if I have used it correctly. My code so far is:
outdata50time <- ts(data=outdata50, start=1950, end=2002)
Will that do the trick? Or do I need different classes for the countries?
Thanks for your help!
Load the data set (I added some data points to the data set in the question):
library(data.table)
test <- data.table(structure(list(Country = structure(c(1L, 1L, 2L, 2L), .Label = c("Angola",
"Benin"), class = "factor"), Year = c(1950L, 1951L, 1950L, 1951L
), Data = c(23L, 24L, 45L, 64L)), .Names = c("Country", "Year",
"Data"), class = "data.frame", row.names = c(NA, -4L)))
Once you got this, I would create some sort of a loop to extract the data related to each country. An example for one country would be the following:
ts <- ts(test[Country=="Benin"]$Data, start=(1950), frequency=1)
ts
Time Series:
Start = 1950
End = 1951
Frequency = 1
[1] 45 64