Tidy Data Layout - convert variables into factors - r

I have the following data table
| State | Prod. |Non-Prod.|
|-------|-------|---------|
| CA | 120 | 23 |
| GA | 123 | 34 |
| TX | 290 | 34 |
How can I convert this table to tiny data format in R or any other software like Excel?
|State | Class | # of EEs|
|------|----------|---------|
| CA | Prod. | 120 |
| CA | Non-Prod.| 23 |
| GA | Prod. | 123 |
| GA | Non-Prod.| 34 |

Trying using reshape2:
library(reshape2)
melt(df,id.vars='State')
# State variable value
# 1 CA Prod 120
# 2 GA Prod 123
# 3 TX Prod 290
# 4 CA Non-Prod. 23
# 5 GA Non-Prod. 34
# 6 TX Non-Prod. 34

Related

How to join dataframes by ID column? [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
I have 3 Dataframes like the ones below with IDs that may not necessarily occur in all
DF1:
ID | Name | Phone# | Country | State | Amount_month1
0210 | John K. | 8942829725 | USA | PA | 1300
0215 | Peter | 8711234566 | USA | KS | 50
2312 | Steven | 9012341221 | USA | TX | 1000
0005 | Haris | 9167456363 | USA | NY | 1200
DF2:
ID | Name | Phone# | Country | State | Amount_month2
0210 | John K. | 8942829725 | USA | PA | 200
2312 | Steven | 9012341221 | USA | TX | 350
2112 | Jerry | 9817273794 | USA | CA | 100
DF3:
ID | Name | Phone# | Country | State | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 300
0005 | Haris | 9167456363 | USA | NY | 1250
1212 | Jerry | 9817273794 | USA | CA | 1200
1210 | Drew | 8012341234 | USA | TX | 1400
I would like to join these 3 dataframes by ID and add the varying column amounts as separate columns, the missing amount values can be either 0 or NA such as:
ID | Name | Phone# | Country |State| Amount_month1 | Amount_month2 | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 1300 | 200 | 300
0215 | Peter | 8711234566 | USA | KS | 50 | 0 | 0
2312 | Steven | 9012341221 | USA | TX | 1000 | 350 | 0
0005 | Haris | 9167456363 | USA | NY | 1200 | 0 | 1250
1212 | Jerry | 9817273794 | USA | CA | 0 | 100 | 1200
1210 | Drew | 8012341234 | USA | TX | 0 | 0 | 1400
It can be done in a single line using Reduce and merge
Reduce(function(x, y) merge(x, y, all=TRUE), list(DF1, DF2, DF3))
You can use left_join from the package dplyr first joining the first two df`s, then joining that result with the third df:
library(dplyr)
df_12 <- left_join(df1,df2, by = "ID")
df_123 <- left_join(df_12, df3, by = "ID")
Result:
df_123
ID Amount_month1 Amount_month2 Amount_month3
1 1 100 NA NA
2 2 200 50 NA
3 3 300 177 666
4 4 400 NA 77
Mock data:
df1 <- data.frame(
ID = as.character(1:4),
Amount_month1 = c(100,200,300,400)
)
df2 <- data.frame(
ID = as.character(2:3),
Amount_month2 = c(50,177)
)
df3 <- data.frame(
ID = as.character(3:4),
Amount_month3 = c(666,77)
)

Convert decimal date to year and week number [duplicate]

This question already has answers here:
Converting date in Year.decimal form in R
(2 answers)
Closed 3 years ago.
I am running an arima model the library forecast, the output of this model consists in something like this:
+----------+----------------+------------+----------+-----------+----------+
| | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+----------------+------------+----------+-----------+----------+
| 2016.261 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+----------------+------------+----------+-----------+----------+
What I would like to achieve is to convert the decimal date (for example 2016.261), by adding two columns, one representing the year and the other one the number of week, achieveing something like this:
+----------+---------+------+----------------+------------+----------+-----------+----------+
| | year | week | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
| 2016.261 | 20.. | n1 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 20.. | n1 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 20.. | n3 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 20.. | n4 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
Well, with dataframe like this for example:
df1 <- data.frame(x =c(2016.01, 2016.32, 2016.261, 2016.281 , 2016.300 , 2016.319))
df1$date <- as.Date(as.character(df1$x), format="%Y.%j")
df1$year <- format(df1$date, "%Y")
df1$week <- format(df1$date, "%W")
df1
# x date year week
# 1 2016.010 2016-01-01 2016 00
# 2 2016.320 2016-02-01 2016 05
# 3 2016.261 2016-09-17 2016 37
# 4 2016.281 2016-10-07 2016 40
# 5 2016.300 2016-01-03 2016 00
# 6 2016.319 2016-11-14 2016 46
NB: I added first two dates just to check that the dates were correct. And istead of df1 you can use your dataframe. All information is actually from here.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

copy command in cassandra execution order

I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!
This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Resources