Manipulating csv spreadsheet using R - r

I want to write a basic loop that looks like this:
Import spreadsheet as data frame
scanning by Variable in header find missing data point "NA" remove all data for that calendar month for that variable, i.e.:
Here var 'X' has 'NA' at the second january. I want to remove all january values of 'X'
X Y Z
jan 3 3 3
jan NA 4 5
jan 2 6 2
feb 1 8 NA
feb 4 2 3
feb 9 4 1
mar 5 NA 5
mar 8 7 4
mar 9 7 5
Creating new dataframes that looks like:
X
feb 1
feb 4
feb 9
mar 5
mar 8
mar 9
Y
jan 3
jan 4
jan 6
feb 8
feb 2
feb 4
Z
jan 3
jan 5
jan 2
mar 5
mar 4
mar 5
Save remaining 'complete months' (in this case 'X'feb-mar, 'Y' jan-feb, 'Z' jan&mar) in new data frame to export as new .csv file
Any help getting started would be huge. If this has already been asked please direct me to the source I wasn't sure exactly how search for this.

Try:
ddf2 = ddf[,c(1,2)]
xdf = ddf[ddf$month!=ddf2$month[is.na(ddf2$X)], c(1,2)]
xdf
month X
4 feb 1
5 feb 4
6 feb 9
7 mar 5
8 mar 8
9 mar 9
ddf2 = ddf[,c(1,3)]
ydf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,3)]
ydf
month Y
1 jan 3
2 jan 4
3 jan 6
4 feb 8
5 feb 2
6 feb 4
ddf2 = ddf[,c(1,4)]
zdf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,4)]
zdf
month Z
1 jan 3
2 jan 5
3 jan 2
7 mar 5
8 mar 4
9 mar 5

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

Search in a column based on the value of a different column

I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA

How do I keep strings in first column with tidyr::gather?

This may be a very basic question about tidyr, which I just started learning, but I don't seem to find an answer after much searching in SO and Google.
Suppose I have a data frame:
mydf<- data.frame(name=c("Joe","Mary","Bob"),
jan=1:3,
feb=4:6,
mar=7:9,
apr=10:12)
which I want to reshape from wide to long. Before, I used melt, so:
library(reshape)
melt(mydf,id.vars = "name",measure.vars = colnames(mydf)[-1])
Which produces
name variable value
1 Joe jan 1
2 Mary jan 2
3 Bob jan 3
4 Joe feb 4
5 Mary feb 5
6 Bob feb 6
7 Joe mar 7
8 Mary mar 8
9 Bob mar 9
10 Joe apr 10
11 Mary apr 11
12 Bob apr 12
I wanted to use tidyr::gather, so I tried
gather(mydf,month,sales,jan:apr)
Which produces
name month sales
1 2 jan 1
2 3 jan 2
3 1 jan 3
4 2 feb 4
5 3 feb 5
6 1 feb 6
7 2 mar 7
8 3 mar 8
9 1 mar 9
10 2 apr 10
11 3 apr 11
12 1 apr 12
I'm lost here, as I haven't been able to keep the names in the first column.
What am I missing here?
######### EDIT TO ADD #######
> R.Version()$version.string
[1] "R version 3.2.2 (2015-08-14)"
> packageVersion("tidyr")
[1] ‘0.3.0’
It looks like in tidyr 0.3.0 you will need to convert the factor column name to character. I'm not sure why that has changed from version 0.2.0, where it worked without conversion to character. Nevertheless, here we go ...
gather(transform(mydf, name = as.character(name)), month, sales, jan:apr)
# name month sales
# 1 Joe jan 1
# 2 Mary jan 2
# 3 Bob jan 3
# 4 Joe feb 4
# 5 Mary feb 5
# 6 Bob feb 6
# 7 Joe mar 7
# 8 Mary mar 8
# 9 Bob mar 9
# 10 Joe apr 10
# 11 Mary apr 11
# 12 Bob apr 12
R.version.string
# [1] "R version 3.2.2 (2015-08-14)"
packageVersion("tidyr")
# [1] ‘0.3.0’
Credit to #aosmith for finding the closed github issue. You should be able to use the development version without issue now. To install the dev version, use
devtools::install_github(
"hadley/tidyr",
ref = "2e08772d154babcc97912bcae8b0b64b65b964ab"
)

Merge 2 resulting vectors into 1 data frame using R

I have a df like this
Month <- c('JAN','JAN','JAN','JAN','FEB','FEB','MAR','APR','MAY','MAY')
Category <- c('A','A','B','C','A','E','B','D','E','F')
Year <- c(2014,2015,2015,2015,2014,2013,2015,2014,2015,2013)
Number_Combinations <- c(3,2,3,4,1,3,6,5,1,1)
df <- data.frame(Month ,Category,Year,Number_Combinations)
df
Month Category Year Number_Combinations
1 JAN A 2014 3
2 JAN A 2015 2
3 JAN B 2015 3
4 JAN C 2015 4
5 FEB A 2014 1
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
9 MAY E 2015 1
10 MAY F 2013 1
I have another df that I got from the above dataframe with a condition
df1 <- subset(df,Number_Combinations > 2)
df1
Month Category Year Number_Combinations
1 JAN A 2014 3
3 JAN B 2015 3
4 JAN C 2015 4
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
Now I want to create a table reporting the month, the total number of rows for the month in df and the total number of for the month in df1
Desired Output would be
Month Number_Month_df Number_Month_df1
1 JAN 4 3
2 FEB 2 1
3 MAR 1 1
4 APR 1 1
5 MAY 2 0
While I used table(df) and table(df1) and tried merging but not getting the desired result. Could someone please help me in getting the above dataframe?
We get the table of the 'Month' column from both 'df' and 'df1', convert to 'data.frame' (as.data.frame), merge by the 'Var1', and change the column names accordingly.
res <- merge(as.data.frame(table(df$Month)),
as.data.frame(table(df1$Month)), by='Var1')
colnames(res) <- c('Month', 'Number_Month_df', 'Number_Month_df1')
res <- data.frame(Number_Month_df=sort(table(df$Month),T),
Number_Month_df1=sort(table(df1$Month),T))
res$Month <- rownames(res)

Resources