transposing rows into multiple columns in R - r

I have following data frame df
ID Year Var value
1 2011 x1 1.2
1 2011 x2 2
1 2012 x1 1.5
1 2012 x2 2.3
3 2013 x1 3
3 2014 x1 4
4 2015 x1 5
5 2016 x1 6
4 2016 x1 2
I want to transform the data in following format
ID Year x1 x2
1 2011 1.2 2
1 2011 2 NA
1 2012 1.5 2.3
3 2013 3 NA
3 2014 4 4
4 2015 5 NA
4 2016 2 NA
5 2016 6 NA
Please help

Using the tidyr library, I believe this is what you are looking for:
library(tidyr)
df <- data.frame(stringsAsFactors=FALSE,
ID = c(1L, 1L, 1L, 1L, 3L, 3L, 4L, 5L, 4L),
Year = c(2011L, 2011L, 2012L, 2012L, 2013L, 2014L, 2015L, 2016L, 2016L),
Var = c("x1", "x2", "x1", "x2", "x1", "x1", "x1", "x1", "x1"),
value = c(1.2, 2, 1.5, 2.3, 3, 4, 5, 6, 2)
)
df2 <- df %>%
spread(Var, value)

Related

In R: How to tell R that a value should be inserted into a categorical column while applying two conditions

What we have:
companyID year status
1 2010
1 2011
1 2012 2
1 2013
1 2014
2 2007
2 2008
2 2009 2
2 2010
2 2011
2 2012 1
2 2013
For companyID 1: I have the observation with status 2 in year 2012. I would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations after that (the status 2 in 2012) to a status of 2 (still per company).
For companyID 2: I have the observation with status 2 in year 2009. i would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations to status 2 until a status 1 shows up again (still per company).
(Summing up: Fill in the other value (1) before the one that is already there (2), then continue with 2 until there is another change (change as in: either that there is a new company or that there was a status change that had already been stated in the original dataframe))
This would then look like the following, and is what we want to acheive:
companyID year status
1 2010 1
1 2011 1
1 2012 2
1 2013 2
1 2014 2
2 2007 1
2 2008 1
2 2009 2
2 2010 2
2 2011 2
2 2012 1
2 2013 1
We have a large dataset and that is why this would not be possible manually. Is there a way to code for both of the companyID’s simultaneously (and hence for all the thousands of observations we have) in R?
Here is one way :
library(dplyr)
library(tidyr)
df %>%
group_by(companyID) %>%
fill(status) %>%
mutate(status = replace(status, is.na(status),
ifelse(na.omit(status)[1] == 1, 2, 1))) %>%
ungroup
# companyID year status
# <int> <int> <dbl>
# 1 1 2010 1
# 2 1 2011 1
# 3 1 2012 2
# 4 1 2013 2
# 5 1 2014 2
# 6 2 2007 1
# 7 2 2008 1
# 8 2 2009 2
# 9 2 2010 2
#10 2 2011 2
#11 2 2012 1
#12 2 2013 1
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), year = c(2010L, 2011L, 2012L, 2013L, 2014L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L), status = c(NA,
NA, 2L, NA, NA, NA, NA, 2L, NA, NA, 1L, NA)),
class = "data.frame", row.names = c(NA, -12L))

Changing value of row based on condition in other condition

My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))

R: replacing values in the randomly selected fractions of observations

I am relatively new to R and probably the solution to this problem is rather simple.
Let's imagine that i have nest dataset of bird two species (a and b) like this:
df
year nestid sp egg chick
2013 a1 a 2 1
2013 a2 a NA 1
2013 a3 a NA 0
2013 a4 a NA 1
2013 a5 a NA 0
2013 b1 b 2 0
2013 b2 b NA 1
2013 b3 b NA 2
2013 b4 b NA 1
2014 a1 a NA 1
2014 a2 a NA 1
2014 a3 a 1 1
2014 a4 a NA 1
2014 a5 a NA 1
2014 b1 b NA 1
2014 b2 b NA 2
2014 b3 b NA 2
2014 b4 b NA 1
I want to infer number of eggs for those 'NAs' from number of chicks. It makes sense to replace "NA" by 2 if there were "2" chicks as they lay 2 eggs max.
But i want to replace NAs by "2" for randomly selected 80% of nests with 1 chick and replace by "1" for remaining 20% of the nests with 1 chick for species "a" in year 2013. But this ratio would be 40% and 60% for clutch sizes of 2 and 1 respectively for species "a" in 2014.
I tried like this but could not work out how to code properly.
df%>% mutate(egg=ifelse(egg==0 & chick==2, 2, egg))
df%>%
mutate(egg=ifelse(egg==0 & chick==1 & year==2013, sample_frac(.8)==2, egg))
Any help would be greatly appreciated!
Many thanks
One of the approach could be
set.seed(123)
#missing egg & chick = 2
df$egg <- with(df,ifelse(is.na(egg) & chick == 2, 2, egg))
#2013 data having species = 'a', missing egg & chick = 1
x <- with(df, which(is.na(egg) & chick == 1 & sp == 'a' & year == 2013))
x_sample <- sample(x, round(0.8 * length(x)))
df$egg[x_sample] <- 2
df$egg[setdiff(x, x_sample)] <- 1
#2014 data having species = 'a', missing egg & chick = 1
x <- with(df, which(is.na(egg) & chick == 1 & sp == 'a' & year == 2014))
x_sample <- sample(x, round(0.4 * length(x)))
df$egg[x_sample] <- 2
df$egg[setdiff(x, x_sample)] <- 1
which gives
> df
year nestid sp egg chick
1 2013 a1 a 2 1
2 2013 a2 a 2 1
3 2013 a3 a NA 0
4 2013 a4 a 2 1
5 2013 a5 a NA 0
6 2013 b1 b 2 0
7 2013 b2 b NA 1
8 2013 b3 b 2 2
9 2013 b4 b NA 1
10 2014 a1 a 1 1
11 2014 a2 a 2 1
12 2014 a3 a 1 1
13 2014 a4 a 2 1
14 2014 a5 a 1 1
15 2014 b1 b NA 1
16 2014 b2 b 2 2
17 2014 b3 b 2 2
18 2014 b4 b NA 1
Sample data:
df <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L), nestid = c("a1", "a2", "a3", "a4", "a5",
"b1", "b2", "b3", "b4", "a1", "a2", "a3", "a4", "a5", "b1", "b2",
"b3", "b4"), sp = c("a", "a", "a", "a", "a", "b", "b", "b", "b",
"a", "a", "a", "a", "a", "b", "b", "b", "b"), egg = c(2L, NA,
NA, NA, NA, 2L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA
), chick = c(1L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 1L)), .Names = c("year", "nestid", "sp",
"egg", "chick"), class = "data.frame", row.names = c(NA, -18L
))

Looping through and comparing values in 2 dataframes

I have 2 dataframes that I need to loop through.
Df1[1:5,]
year month Vol
1 2015 7 4.82e-05
2 2015 6 5.91e-05
3 2015 5 6.56e-05
4 2015 4 6.10e-05
5 2015 3 7.85e-05
Df2[1:5,]
year month IB
1 2015 7 0
2 2015 4 1
3 2015 3 0
4 2015 6 1
5 2015 5 0
I need to loop through DF1, compare the months from DF1 and DF2, and if they are the same then set DF1$IB<-DF2$IB. I tried using sapply, but I get this error
tmp<-sapply(DF1$month,function(x){if(DF2$month==x){
DF1$IB<-DF2$IB
}})
Warning messages:
1: In if (DF2$month == x) { :
the condition has length > 1 and only the first element will be used
.....
Any help would be greatly appreciated. Otherwise I would have to resort to multiple for loops, and since DF1 is 900K rows long and DF2 is 300 rows long, that seems very inefficient to me.
With the latest version (see here how to install v1.9.5 from GH) you don't need to set keys and just need setDT(df1)[df2, on = c("year","month")] which add the IB, this gives:
year month Vol IB
1: 2015 7 4.82e-05 0
2: 2015 4 6.10e-05 1
3: 2015 3 7.85e-05 0
4: 2015 6 5.91e-05 1
5: 2015 5 6.56e-05 0
Supposing that the year/month are not equal for both datasets, you have to join differently:
setDT(df2)[df1, on = c("year","month")]
which gives:
year month IB Vol
1: 2015 7 0 4.82e-05
2: 2015 6 1 5.91e-05
3: 2015 5 0 6.56e-05
4: 2015 4 1 6.10e-05
5: 2015 3 NA 7.85e-05
Used data for second example:
df1 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 6L, 5L, 4L, 3L), Vol = c(4.82e-05, 5.91e-05, 6.56e-05, 6.1e-05, 7.85e-05)), .Names = c("year", "month", "Vol"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
df2 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 4L, 2L, 6L, 5L), IB = c(0L, 1L, 0L, 1L, 0L)), .Names = c("year", "month", "IB"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
If your Df1 is that large data.tables might be better than merge.
library(data.table)
setkey(setDT(Df1),year,month)[setDT(Df2),IB:=IB]
Df1
# year month Vol IB
# 1: 2015 3 7.85e-05 0
# 2: 2015 4 6.10e-05 1
# 3: 2015 5 6.56e-05 0
# 4: 2015 6 5.91e-05 1
# 5: 2015 7 4.82e-05 0
So this converts Df1 to a data.table in indexes it on year and month, then does a data.table join on Df2 (also converted to a data.table), then adds the IB column from Df2 to Df1.
Using a more realistic example:
set.seed(1)
Df1 <- data.frame(year=rep(2015,1e6),
month=sample(3:7,1e6,replace=TRUE),
Vol=rnorm(1e6))
system.time(result.mrg <- merge(Df1,Df2,by=c("year","month")))
# user system elapsed
# 11.8 0.0 11.8
system.time(result.dt <- setkey(setDT(Df1),year,month[setDT(Df2),IB:=IB])
# user system elapsed
# 0.07 0.00 0.06
identical(result.mrg$IB, result.dt$IB)
# [1] TRUE

Grouping data and then assigning values to variable names stored in strings - R

I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))

Resources