plugin in values of one data frame into another - r

I have the following data frames(month columns with NA value) : df
Vehicle Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec <br/>
123X<br/>
435y<br/>
...<br/>
where number of row is defined by the number of unique vehicle number. Number of column is 13
the following data frame is generated in for loop for each vehicle to find out the number of breakdown of each vehicle in each month
[for a single vehicle]
occurrences:<br/>
Month Freq<br/>
Jan 1<br/>
Mar 3<br/>
Jul 5<br/>
May 3<br/>
each time the occurrence data frame is generated i want to plug in the freq value into the df data frame by the month name.
I have tried using for loop, which is becoming very very complex.
Is there any easier way plug in the value from the occurrences data frame into the df data frame?
I am expecting the following result: for df
Vehicle Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
123X 1 na 3 na 3 na 5 na na na na na
435y
...

Related

Using R how do I delete duplicates based on multiple columns but select the "most" completed version of the duplicates

I have a script that would delete based on duplicate values in three columns. There are way more than three columns but i want to delete based on those specific ones
DF2021 <-DF2021 [!duplicated (DF2021[,c("column1","column2","column3")]),]
The script above works and it leaves me with one row for each time there is a duplicate based on those three columns.
The next step is where I wonder how to make sure I'am left with the row based on criteria. For example I want the row with the least NA's.
column1|column2|column3|column4|column5|column6|column 7
Jan Tue 2020 Blue Warm Hospital NA
Jan Tue 2020 Blue Warm NA NA
Jan Tue 2020 Blue NA NA NA
Feb Thu 2020 Red NA NA NA
Feb Thu 2020 Red Warm NA NA
Feb Thu 2020 Red Warm Garden Run
Mar Thu 2020 Red Cold Desk Bus
In the end I would expect the duplicate value to leave me with three rows.
column1|column2|column3|column4|column5|column6|column 7
Jan Tue 2020 Blue Warm Hospital NA
Feb Thu 2020 Red Warm Garden Run
Mar Thu 2020 Red Cold Desk Bus
Note that if i were to do
DF2021 <- DF2021[complete.cases(DF2021),]
It would only give me the Feb and Mar row but not the Jan. I want the script to remove duplicates and take the "most" but doesn't have to "full" rows out of the duplicates based on those three rows.
Try this. You can create a function to detect the complete rows and those with only one NA. With that you can use indexing and select that rows. Here the code:
#Index for selection
myfun <- function(x)
{
y <- length(which(is.na(x)))
y <- ifelse(y<=1,1,0)
return(y)
}
#Apply
index <- which(apply(df,1,myfun)==1)
#Output
out <- df[index,]
Output:
column1 column2 column3 column4 column5 column6 column7
1 Jan Tue 2020 Blue Warm Hospital <NA>
6 Feb Thu 2020 Red Warm Garden Run
7 Mar Thu 2020 Red Cold Desk Bus

Compute factor data between two data frames in R

I have not found a solution for this, and I think it should be very simple but now I can't think right.
I have two data frames, monthly traffic volume averages, and yearly traffic volume averages. I need to divide yearly averages by monthly averages.
ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
So the first 12 rows of the TPDM table should divide the first row of the TPDA table and create a new data frame which should contain monthly factors.
Something like:
ano mes dias FA_Au
2012 Ene 31 4271.096/4288.323
2012 Feb 29 4271.096/3268.862
(Don't need to show the computation, just the result)
I am sure that selecting the data by year would do that but haven't found the right way to do it.
Merge by year and find columns to divide by position
As already mentioned by zx8754 this can be done by merging on year and dividing the corresponding columns in base R:
merged <- merge(TPDM, TPDA, by.x = "ano", by.y = "anosDB")
FA <- cbind(merged[, 1:3], merged[, 10:15]/merged[, 4:9])
# rename columns
names(FA) <- sub("TPDA_", "FA_", names(FA))
FA
ano mes dias FA_Au FA_Bu FA_CU FA_CAI FA_CAII FA_TOTAL
1 2012 Ene 31 0.9959828 0.9951086 1.0779532 1.1044977 1.1757530 1.705872
2 2012 Feb 29 1.3066003 1.0412831 1.0360781 0.9862042 1.1050245 1.663675
3 2012 Mar 31 1.1644517 0.9884285 0.8829349 0.9583809 1.0566337 1.546941
4 2012 Abr 30 0.9148231 0.9540314 0.9073122 1.0383376 1.0328838 1.440892
5 2012 May 31 1.3353096 1.0309085 0.9963600 0.9507334 0.8084003 1.576802
6 2012 Jun 30 1.1834349 1.0191696 0.9103332 0.9642064 0.9149720 1.534471
Caveat:
This approach works as long as the positions, i.e., column numbers, of the corresponding columns are known. With the given datasets, the columns are ordered in the same way. Therefore, only an offset has to be considered to match corresponding columns.
Merge by year and find columns to divide by name
If, for some reason, the positions are not known in advance we can find corresponding columns by matching the column names.
For this, both datasets are reshaped from wide to long format. In long format, the column names (now called variable) are treated as data. Now, we can join monthly and annual values on year and column name, divide annual values by the corresponding monthly values, and reshape back to wide format, finally:
library(data.table)
# reshape and prepare monthly data
longM <- melt(setDT(TPDM), id.vars = 1:3)
longM[, variable := stringr::str_replace(variable, "_TPDM", "")]
longM[, mes := forcats::fct_inorder(mes)]
# reshape and prepare annual data
longA <- melt(setDT(TPDA), id.vars = 1)
longA[, variable := stringr::str_replace(variable, "TPDA_", "")]
setnames(longA, "anosDB", "ano")
# join
long_FA <- longA[longM, on = .(ano, variable),
.(ano, mes, dias, variable, FA = value/i.value)]
# reshape back to wide format
dcast(long_FA, ano + mes +dias ~ paste0("FA_", variable), value.var = "FA")
ano mes dias FA_Au FA_Bu FA_CAI FA_CAII FA_CU FA_TOTAL
1: 2012 Ene 31 0.9959828 0.9951086 1.1044977 1.1757530 1.0779532 1.705872
2: 2012 Feb 29 1.3066003 1.0412831 0.9862042 1.1050245 1.0360781 1.663675
3: 2012 Mar 31 1.1644517 0.9884285 0.9583809 1.0566337 0.8829349 1.546941
4: 2012 Abr 30 0.9148231 0.9540314 1.0383376 1.0328838 0.9073122 1.440892
5: 2012 May 31 1.3353096 1.0309085 0.9507334 0.8084003 0.9963600 1.576802
6: 2012 Jun 30 1.1834349 1.0191696 0.9642064 0.9149720 0.9103332 1.534471
Data
TPDM <- read.table(text = "
i ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
", header = TRUE)[, -1L]
TPDA <- read.table(text = "
i anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
", header = TRUE)[, -1L]

R Studio: look up a value in table(both direction V&H), then use as a variable in loop

I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount

How to complete missing values with Na in a list?

I have a data frame that has the following column: Tree ID, month, values. For some months, there is no recorded data, therefore those months do not exist in the data frame. I have completed the list with the missing months but now I do not know how to insert NA in the value column for the added months.
Example:
Tree.Id: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Month: Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec
Values: 1,0,1,1,0,2,1,1,0,2
The following months are missing: Apr, Aug,
I added them with the code below, and now I want for those 2 added months to introduce NA in the value column.
Here is what I tried:
tree_ls <- list()
for (i in unique(data$Tree.ID)){
mon1 <- data$month[data$Tree.ID == i] ### extract the month for every Tree iD
mon <- min(mon1, na.rm=T):max(mon1, na.rm=T) # completes the numbers with the missing month
dat1 <- data$value[data$Tree.ID == i]
......
After this step, I do not know how to create a list that will add NA for all the added months that were missing, so I will have lists of the same length.
Thanks
This is an old post, but I have a pretty good solution for this:
To begin, your small reproducible code should probably be the following:
month <- c(Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec)
value <- c(1,0,1,1,0,2,1,1,0,2)
df <- data.frame(id=id, month=month,value=value)
> head(df)
id month value
1 1 Jan 1
2 2 Feb 0
3 3 Mar 1
4 4 May 1
5 5 Jun 0
6 6 Jul 2
Now just simply introduce an entire list of your domain, e.g., your months you want to obtain NA's where missing.
completeMonths <- c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul","Aug", "Sept", "Oct", "Nov", "Dec")
df2 <- dataframe(month=completeMonths)
> df2
month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sept
10 Oct
11 Nov
12 Dec
Now we have a column with all the underlying values, so when we merge, we can fill the missing rows as NA with the following syntax:
merge(df, df2, on=month, all=TRUE)
With our results as follows:
month id value
1 Dec 10 2
2 Feb 2 0
3 Jan 1 1
4 Jul 6 2
5 Jun 5 0
6 Mar 3 1
7 May 4 1
8 Nov 9 0
9 Oct 8 1
10 Sept 7 1
11 Apr NA NA
12 Aug NA NA
Hope this helps, data wrangling sucks.
When you say that you have a data frame with some months that have "no recorded data" and therefore "do not exist", the fact that they are in the data frame at all means they have some representation. I'm going to guess that by "do not exist" you mean that they are blank strings, such as "". If that's the case, you can replace the blank strings with NA values using mutate in the dplyr package and ifelse in the base package as follows:
library(dplyr);
data_with_nas <- mutate(data, value = ifelse(value=="", NA, value));
That reads as "change the data data frame such that its value cells are replaced with NA if they were a blank string, or kept as is otherwise."

import txt file with desired data structure in R

The txt is like
#---*----1----*----2----*---
Name Time.Period Value
A Jan 2013 10
B Jan 2013 11
C Jan 2013 12
A Feb 2013 9
B Feb 2013 11
C Feb 2013 15
A Mar 2013 10
B Mar 2013 8
C Mar 2013 13
I tried to use read.table with readLines and count.field as shown belows:
> path <- list.files()
> data <- read.table(text=readLines(path)[count.fields(path, blank.lines.skip=FALSE) == 4])
Warning message:
In readLines(path) : incomplete final line found on 'data1.txt'
> data
V1 V2 V3 V4
1 A Jan 2013 10
2 B Jan 2013 11
3 C Jan 2013 12
4 A Feb 2013 9
5 B Feb 2013 11
6 C Feb 2013 15
7 A Mar 2013 10
8 B Mar 2013 8
9 C Mar 2013 13
The problem is that it give four attributes instead of three. Therefore i manipulate my data as below which seeking a alternative.
> library(zoo)
> data$Name <- as.character(data$V1)
> data$Time.Period <- as.yearmon(paste(data$V2, data$V3, sep=" "))
> data$Value <- as.numeric(data$V4)
> DATA <- data[, 5:7]
> DATA
Name Time.Period Value
1 A Jan 2013 10
2 B Jan 2013 11
3 C Jan 2013 12
4 A Feb 2013 9
5 B Feb 2013 11
6 C Feb 2013 15
7 A Mar 2013 10
8 B Mar 2013 8
9 C Mar 2013 13
You can use read.fwf to read fixed width files. You need to correctly specify the width of each column, in spaces.
data <- read.fwf(path, widths=c(-12, 8, -4, 2), header=T)
The key there is how you specify the width. Negative means skip that many places, positive means read that many. I am assuming entries in the last column have only 2 digits. Change widths accordingly if this is not the case. You will probably also have to fix the column names.
You will have to change the indices if the file format changes, or come up with some clever regexp to read it from the first few rows. A better solution would be to enclose your strings in " or, even better, avoid the format altogether.
?count.fields
As the R Documentation states count.fields counts the number of fields, as separated by sep, in each of the lines of file read, when you set count.fields(path, blank.lines.skip=FALSE) == 4 it will skip the header row which actually has three fields.

Resources