How do I keep strings in first column with tidyr::gather? - r

This may be a very basic question about tidyr, which I just started learning, but I don't seem to find an answer after much searching in SO and Google.
Suppose I have a data frame:
mydf<- data.frame(name=c("Joe","Mary","Bob"),
jan=1:3,
feb=4:6,
mar=7:9,
apr=10:12)
which I want to reshape from wide to long. Before, I used melt, so:
library(reshape)
melt(mydf,id.vars = "name",measure.vars = colnames(mydf)[-1])
Which produces
name variable value
1 Joe jan 1
2 Mary jan 2
3 Bob jan 3
4 Joe feb 4
5 Mary feb 5
6 Bob feb 6
7 Joe mar 7
8 Mary mar 8
9 Bob mar 9
10 Joe apr 10
11 Mary apr 11
12 Bob apr 12
I wanted to use tidyr::gather, so I tried
gather(mydf,month,sales,jan:apr)
Which produces
name month sales
1 2 jan 1
2 3 jan 2
3 1 jan 3
4 2 feb 4
5 3 feb 5
6 1 feb 6
7 2 mar 7
8 3 mar 8
9 1 mar 9
10 2 apr 10
11 3 apr 11
12 1 apr 12
I'm lost here, as I haven't been able to keep the names in the first column.
What am I missing here?
######### EDIT TO ADD #######
> R.Version()$version.string
[1] "R version 3.2.2 (2015-08-14)"
> packageVersion("tidyr")
[1] ‘0.3.0’

It looks like in tidyr 0.3.0 you will need to convert the factor column name to character. I'm not sure why that has changed from version 0.2.0, where it worked without conversion to character. Nevertheless, here we go ...
gather(transform(mydf, name = as.character(name)), month, sales, jan:apr)
# name month sales
# 1 Joe jan 1
# 2 Mary jan 2
# 3 Bob jan 3
# 4 Joe feb 4
# 5 Mary feb 5
# 6 Bob feb 6
# 7 Joe mar 7
# 8 Mary mar 8
# 9 Bob mar 9
# 10 Joe apr 10
# 11 Mary apr 11
# 12 Bob apr 12
R.version.string
# [1] "R version 3.2.2 (2015-08-14)"
packageVersion("tidyr")
# [1] ‘0.3.0’
Credit to #aosmith for finding the closed github issue. You should be able to use the development version without issue now. To install the dev version, use
devtools::install_github(
"hadley/tidyr",
ref = "2e08772d154babcc97912bcae8b0b64b65b964ab"
)

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Extracting strings from links using regex in R

I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014
You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)
Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014
You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014

Replicating table in R with change in one column

I have this table in R :
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
I want to replicate this same table 4 times, all values should be the same. Except the Month column, which needs to be incremented by 1 every time. And the final table should look like this:
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
John 8 2017 8 16
Carol 90 2017 8 30
Bug 9 2017 8 1
John 8 2017 9 16
Carol 90 2017 9 30
Bug 9 2017 9 1
John 8 2017 10 16
Carol 90 2017 10 30
Bug 9 2017 10 1
John 8 2017 11 16
Carol 90 2017 11 30
Bug 9 2017 11 1
Please point how to do this efficiently in R. Many thanks!
If this is your dataframe:
df = read.table(text = "Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1", header = TRUE)
Then this is your dataframe repeating:
df2 = df[rep(rownames(df), 4),]
And this is it again, but with the months incremented:
df2$Month = df2$Month + rep(0:3, 3)
In the more general case:
m = 4 # <-- number of rows desired
df2 = df[rep(rownames(df), m), ]
df2$Month = df2$Month + rep(0:m, nrow(df))

Extracting unique records in R?

I tried "unique" and "duplicated" but cannot get R to do what I want, which is basically compare two sets of data and find out who one the first data set is not on the second data set. data1 contains a customer ID, name and the year that person bought X. data2 contains a customer ID and year (2017) indicating they purchased X this year. What I want to do is extract a list of people from data1 who have NOT purchase X this year...so I can contact them and tell them to buy X again.
> data1
ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014
> data2
ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017
Merging data1 and data2 by ID ( merge(data1,dat2, by"ID") ) gives me:
> merged_d1d2
ID NAME YEAR.x YEAR.y
1 5 Fred 2014 2017
2 7 Sara 2015 2017
3 10 Bill 2014 2017
4 11 Doug 2016 2017
5 15 Matt 2014 2017
...But I want everyone EXCEPT these people! I also added the names into data2 and then combined data1 and data2 using rbind which gives me a data set with duplicates (e.g. 2 Fred, 2 Sara, 2 Bill, etc.) I then tried to use "unique" and "duplicated" but these always leave one of those duplicates (1 Fred, 1 Sara) in the new data. I want everyone from data1 except those people. I have a feeling this is a simple process, but any help would be greatly appreciated.
Simply:
data1[!data1$ID%in%data2$ID,]
ID NAME YEAR
1 8 Ann 2016
4 12 Emma 2015
6 9 Julie 2014
7 13 Karl 2016
9 14 Rhett 2014
11 4 Tom 2014
Or you could try anti_join by ID from dplyr:
data1 <- read.table(text="ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014",header=TRUE, stringsAsFactors=FALSE)
data2 <- read.table(text="ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
anti_join(data1,data2,by="ID")
ID NAME YEAR
1 4 Tom 2014
2 8 Ann 2016
3 9 Julie 2014
4 12 Emma 2015
5 13 Karl 2016
6 14 Rhett 2014

Manipulating csv spreadsheet using R

I want to write a basic loop that looks like this:
Import spreadsheet as data frame
scanning by Variable in header find missing data point "NA" remove all data for that calendar month for that variable, i.e.:
Here var 'X' has 'NA' at the second january. I want to remove all january values of 'X'
X Y Z
jan 3 3 3
jan NA 4 5
jan 2 6 2
feb 1 8 NA
feb 4 2 3
feb 9 4 1
mar 5 NA 5
mar 8 7 4
mar 9 7 5
Creating new dataframes that looks like:
X
feb 1
feb 4
feb 9
mar 5
mar 8
mar 9
Y
jan 3
jan 4
jan 6
feb 8
feb 2
feb 4
Z
jan 3
jan 5
jan 2
mar 5
mar 4
mar 5
Save remaining 'complete months' (in this case 'X'feb-mar, 'Y' jan-feb, 'Z' jan&mar) in new data frame to export as new .csv file
Any help getting started would be huge. If this has already been asked please direct me to the source I wasn't sure exactly how search for this.
Try:
ddf2 = ddf[,c(1,2)]
xdf = ddf[ddf$month!=ddf2$month[is.na(ddf2$X)], c(1,2)]
xdf
month X
4 feb 1
5 feb 4
6 feb 9
7 mar 5
8 mar 8
9 mar 9
ddf2 = ddf[,c(1,3)]
ydf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,3)]
ydf
month Y
1 jan 3
2 jan 4
3 jan 6
4 feb 8
5 feb 2
6 feb 4
ddf2 = ddf[,c(1,4)]
zdf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,4)]
zdf
month Z
1 jan 3
2 jan 5
3 jan 2
7 mar 5
8 mar 4
9 mar 5

Resources