rearrange data in a specific structure - r

I have data like this format:
state
year1
year 2
First
2000
2004-2005
Second
2007
2010-2011
Third
2008
2010
Third
2010
2012
I want to make this:
state
year
First
2000
First
2004-2005
Second
2007
Second
2010-2011
Third
2008
Third
2010
Third
2012
The code can be in R or Python. Thanks in advance

There is a function in the data.table package, called melt( ) which allows you to convert data from wide to long format. In this case I am keeping State as my ID variable and the variables I would like pulled into my value field are Year1 and Year2. There is a line that keeps unique observations to remove duplicates.
library(data.table)
data <- data.table(
State = c("First","Second","Third","Third"),
Year1 = c("2000","2007","2008","2010"),
Year2 = c("2004-2005","2010-2011","2010","2012"))
data
State Year1 Year2
1: First 2000 2004-2005
2: Second 2007 2010-2011
3: Third 2008 2010
4: Third 2010 2012
data2 <- melt(
data = data,
id.vars = c("State"),
measure.vars = c("Year1","Year2"),
variable.name = "Year",
value.name = "years")
data2 <- unique(data2)
data2[order(State),.(State,years)]
State years
1: First 2000
2: First 2004-2005
3: Second 2007
4: Second 2010-2011
5: Third 2008
6: Third 2010
7: Third 2010
8: Third 2012

Related

R: Count number of new observations compared to a previous groups

I would like to know the number of new observations that occurred between groups.
If I have the following data:
Year
Observation
2009
A
2009
A
2009
B
2010
A
2010
B
2010
C
I wound like the output to be
Year
New_Obsevation_Count
2009
2
2010
1
I am new to R and don't really know how to move forward. I have tried using the count function in the tidyverse package but still can't figure out.
You can use union in Reduce:
y <- split(x$Observation, x$Year)
data.frame(Year = names(y), nNew =
diff(lengths(Reduce(union, y, NULL, accumulate = TRUE))))
# Year nNew
#1 2009 2
#2 2010 1
Data:
x <- read.table(header=TRUE, text="Year Observation
2009 A
2009 A
2009 B
2010 A
2010 B
2010 C")

Change name of column after uniqueN function

I am already happy with the results, but want to further tidy up my data by giving the right name to the respective column.
The problem to solve is to give the number of different authors which are included for each years publication between 2000 and 2010. Here is my code and my result:
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000, uniqueN(Book_Author), by = "Year_Of_Publication"][order(Year_Of_Publication)]
Year_Of_Publication V1
1: 2000 12057
2: 2001 11818
3: 2002 11942
4: 2003 9913
5: 2004 4536
6: 2005 38
7: 2006 3
8: 2008 1
9: 2010 2
The numbers in the result are right, but I want to change the column name V1 to something like "Num_Of_Dif_Auth". I tried the setnames function, but as I don`t want to change the underlying dataset it didnĀ“t help.
You can use :
library(data.table)
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000,
.(Num_Of_Dif_Auth = uniqueN(Book_Author)),
by = Year_Of_Publication][order(Year_Of_Publication)]

Compute factor data between two data frames in R

I have not found a solution for this, and I think it should be very simple but now I can't think right.
I have two data frames, monthly traffic volume averages, and yearly traffic volume averages. I need to divide yearly averages by monthly averages.
ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
So the first 12 rows of the TPDM table should divide the first row of the TPDA table and create a new data frame which should contain monthly factors.
Something like:
ano mes dias FA_Au
2012 Ene 31 4271.096/4288.323
2012 Feb 29 4271.096/3268.862
(Don't need to show the computation, just the result)
I am sure that selecting the data by year would do that but haven't found the right way to do it.
Merge by year and find columns to divide by position
As already mentioned by zx8754 this can be done by merging on year and dividing the corresponding columns in base R:
merged <- merge(TPDM, TPDA, by.x = "ano", by.y = "anosDB")
FA <- cbind(merged[, 1:3], merged[, 10:15]/merged[, 4:9])
# rename columns
names(FA) <- sub("TPDA_", "FA_", names(FA))
FA
ano mes dias FA_Au FA_Bu FA_CU FA_CAI FA_CAII FA_TOTAL
1 2012 Ene 31 0.9959828 0.9951086 1.0779532 1.1044977 1.1757530 1.705872
2 2012 Feb 29 1.3066003 1.0412831 1.0360781 0.9862042 1.1050245 1.663675
3 2012 Mar 31 1.1644517 0.9884285 0.8829349 0.9583809 1.0566337 1.546941
4 2012 Abr 30 0.9148231 0.9540314 0.9073122 1.0383376 1.0328838 1.440892
5 2012 May 31 1.3353096 1.0309085 0.9963600 0.9507334 0.8084003 1.576802
6 2012 Jun 30 1.1834349 1.0191696 0.9103332 0.9642064 0.9149720 1.534471
Caveat:
This approach works as long as the positions, i.e., column numbers, of the corresponding columns are known. With the given datasets, the columns are ordered in the same way. Therefore, only an offset has to be considered to match corresponding columns.
Merge by year and find columns to divide by name
If, for some reason, the positions are not known in advance we can find corresponding columns by matching the column names.
For this, both datasets are reshaped from wide to long format. In long format, the column names (now called variable) are treated as data. Now, we can join monthly and annual values on year and column name, divide annual values by the corresponding monthly values, and reshape back to wide format, finally:
library(data.table)
# reshape and prepare monthly data
longM <- melt(setDT(TPDM), id.vars = 1:3)
longM[, variable := stringr::str_replace(variable, "_TPDM", "")]
longM[, mes := forcats::fct_inorder(mes)]
# reshape and prepare annual data
longA <- melt(setDT(TPDA), id.vars = 1)
longA[, variable := stringr::str_replace(variable, "TPDA_", "")]
setnames(longA, "anosDB", "ano")
# join
long_FA <- longA[longM, on = .(ano, variable),
.(ano, mes, dias, variable, FA = value/i.value)]
# reshape back to wide format
dcast(long_FA, ano + mes +dias ~ paste0("FA_", variable), value.var = "FA")
ano mes dias FA_Au FA_Bu FA_CAI FA_CAII FA_CU FA_TOTAL
1: 2012 Ene 31 0.9959828 0.9951086 1.1044977 1.1757530 1.0779532 1.705872
2: 2012 Feb 29 1.3066003 1.0412831 0.9862042 1.1050245 1.0360781 1.663675
3: 2012 Mar 31 1.1644517 0.9884285 0.9583809 1.0566337 0.8829349 1.546941
4: 2012 Abr 30 0.9148231 0.9540314 1.0383376 1.0328838 0.9073122 1.440892
5: 2012 May 31 1.3353096 1.0309085 0.9507334 0.8084003 0.9963600 1.576802
6: 2012 Jun 30 1.1834349 1.0191696 0.9642064 0.9149720 0.9103332 1.534471
Data
TPDM <- read.table(text = "
i ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
", header = TRUE)[, -1L]
TPDA <- read.table(text = "
i anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
", header = TRUE)[, -1L]

Delete rows without a full year data

I got a big data set that contains monthly returns of a given stock. I'd like to delete rows that do not have a full year data. A subset of data is shown below as an example:
Date Return Year
9/1/2009 0.71447 2009
10/1/2009 0.48417 2009
11/1/2009 0.90753 2009
12/1/2009 -0.7342 2009
1/1/2010 0.83293 2010
2/1/2010 0.18279 2010
3/1/2010 0.19416 2010
4/1/2010 0.38907 2010
5/1/2010 0.37834 2010
6/1/2010 0.6401 2010
7/1/2010 0.62079 2010
8/1/2010 0.42128 2010
9/1/2010 0.43117 2010
10/1/2010 0.42307 2010
11/1/2010 -0.1994 2010
12/1/2010 -0.2252 2010
Ideally, the code will remove the first four observations since they don't have a full year of observation.
The OP has requested to remove all rows from a large data set of monthly values which do not make up a full year. Although the solution suggested by Wen seems to be working for the OP I would like to suggest a more robust approach.
Wen's solution counts the number of rows per year assuming that there is exactly one row per month. It would be more robust to count the number of unique months per year in case there are duplicate entries in the production data set.
(From my experience, one cannot be careful enough when dealing with production data and check all assumptions).
library(data.table)
# count number of unique months per year,
# keep only complete years, omit counts
# result is a data.table with one column Year
full_years <- DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"]
full_years
Year
1: 2010
# right join with original table, only rows belonging to a full year will be returned
DT[full_years, on = "Year"]
Date Return Year
1: 2010-01-01 0.83293 2010
2: 2010-02-01 0.18279 2010
3: 2010-03-01 0.19416 2010
4: 2010-04-01 0.38907 2010
5: 2010-05-01 0.37834 2010
6: 2010-06-01 0.64010 2010
7: 2010-07-01 0.62079 2010
8: 2010-08-01 0.42128 2010
9: 2010-09-01 0.43117 2010
10: 2010-10-01 0.42307 2010
11: 2010-11-01 -0.19940 2010
12: 2010-12-01 -0.22520 2010
Note that this approach avoids to add a count column to each row of a potentially large data set.
The code can be written more concisely as:
DT[DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"], on = "Year"]
It is also possible to check the data for any duplicate months, e.g.,
stopifnot(all(DT[, .N, by = .(Year, month(Date))]$N == 1L))
This code counts the number of occurrences for each year and month and halts execution when there is more than one.

Long Format Function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
faster way to create variable that aggregates a column by id
I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!
KeyItem Bank Country Year Value
1 Pretax Income WELLS_FARGO_&_COMPANY UNITED STATES 2011 2.365600e+10
2 Total Assets WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.313867e+12
3 Total Liabilities WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.172180e+12
The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.
First, some sample data.
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
Here's the same thing in data.table
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583

Resources