Manipulating data in R from columns to rows - r

I have data that is currently organized as follows:
X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65
I would like to convert it to something more like this:
Year State Price Pounds
1980 MN 56 23
1999 MN 41 63
1980 WI 56 96
1999 WI 56 65
Any suggestions for some R-code to manipulate this data correctly?
Thanks!

This requires some manipulation to get it into a format that you can reshape.
df <- read.table(h=T, t=" X.1 State MN X.2 WI X.3
NA NA Price Pounds Price Pounds
Year NA NA NA NA NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65")
df <- df[-2]
# Auto-process names; you should look at intermediate step results to see
# what's going on. This would probably be better addressed with something
# like `na.locf` from `zoo` but this is all in base. Note you can do something
# a fair bit simpler if you know you have the same number of items for each
# state, but this should be robust to different numbers.
df.names <- names(df)
df.names <- ifelse(grepl("X.[0-9]+", df.names), NA, df.names)
df.names[[1]] <- "Year"
df.names.valid <- Filter(Negate(is.na), df.names)
df.names[is.na(df.names)] <- df.names.valid[cumsum(!is.na(df.names))[is.na(df.names)]]
names(df) <- df.names
# rename again by adding Price/Pounds
names(df)[-1] <- paste(
vapply(2:5, function(x) as.character(df[1, x]), ""), # need to do this because we're pulling across different factor columns
names(df)[-1],
sep="."
)
df <- df[-(1:2),] # Don't need rows 1:2 anymore
df
Produces:
Year Price.MN Pounds.MN Price.WI Pounds.WI
3 1980 56 23 56 96
4 1999 41 63 56 65
Then:
using base reshape:
reshape(df, direction="long", varying=2:5)
Which gets you basically where you want to be:
Year time Price Pounds id
1.MN 1980 MN 56 23 1
2.MN 1999 MN 41 63 2
1.WI 1980 WI 56 96 1
2.WI 1999 WI 56 65 2
Clearly you'll want to rename some columns, etc., but that's straightforward. The key point with reshape is that the column names matter so we constructed them in a way that reshape can use.
using reshape2::melt/cast:
library(reshape2)
df.mlt <- melt(df, id.vars="Year")
df.mlt <- transform(df.mlt,
metric=sub("\\..*", "", variable),
state=sub(".*\\.", "", variable)
)
dcast(df.mlt[-2], Year + state ~ metric)
produces:
Year state Pounds Price
1 1980 MN 23 56
2 1980 WI 96 56
3 1999 MN 63 41
4 1999 WI 65 56
BE VERY CAREFUL, it is likely that Price and Pounds are factors because the column used to have both character and numeric values. You will need to convert to numeric with as.numeric(as.character(df$Price)).

Well that was a nice challenge. It's a lot of strsplits and greps, and it may not generalize to your entire data set. Or maybe it will, you never know.
> txt <- "X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65"
>
> x <- textConnection(txt)
> y <- gsub("((X[.][0-9]{1})|NA)|\\s+", " ", readLines(x))
> z <- unlist(strsplit(y, "^\\s+"))
> a <- z[nzchar(z)]
> b <- unlist(strsplit(a, "\\s+"))
> nums <- as.numeric(grep("[0-9]", b[nchar(b) == 2], value = TRUE))
> Price = rev(nums[c(TRUE, FALSE)])
> pounds <- nums[-which(nums %in% Price)]
> data.frame(Year = rep(b[grepl("[0-9]{4}", b)], 2),
State = unlist(lapply(b[grepl("[A-Z]{2}", b)], rep, 2)),
Price = Price,
Pounds = c(pounds[1], rev(pounds[2:3]), pounds[4]))
## Year State Price Pounds
## 1 1980 MN 56 23
## 2 1999 MN 41 63
## 3 1980 WI 56 96
## 4 1999 WI 56 65

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Calculating the difference between two two-digit years

Is there any easy way in R to calculate the difference between two columns of two-digit years (just years, no months/days as it's unnecessary here) in order to produce a column of ages?
I've fairly new to this and have been playing with 'if' statements and algebra without success.
The data looks like this, but larger:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))
You could use strptime() with the format %y:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"),
stringsAsFactors = F) # You might want to use this as a default!
dat$year1 <- strptime(dat$year1, format = "%y")
dat$year2 <- strptime(dat$year2, format = "%y")
as.vector(difftime(dat$year2,
dat$year1,
units = "days"))/365.242
4.999311 5.002163 4.999425 4.999425 4.999425
Format to a date, format back to a number, take the difference:
do.call(`-`, lapply(dat[1:2], function(x)
as.numeric(format(as.Date(x, format="%y"), "%Y"))))
#[1] -5 -5 -5 -5 -5
This may hit cases where it doesn't work if you have old dates in the early 1900's. As per ?strptime:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
should work under the assumption year2 is some kind of current year and year1 is the year of birth and there are no people born before 1918.
Example:
df <- data.frame(year1 = sample(18:99, 1000, replace = T),
year2 = sample(1:99, 1000, replace = T))
> head(df)
year1 year2
1 27 88
2 41 55
3 90 36
4 81 93
5 56 60
6 27 61
df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
> head(df)
year1 year2 age
1 73 88 15
2 50 17 67
3 47 41 94
4 54 43 89
5 36 82 46
6 62 85 23
With your data example:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))
dat$age <- ifelse(as.numeric(as.character(dat$year2)) < as.numeric(as.character(dat$year1)),
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)) + 100,
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)))
> dat
year1 year2 age
1 98 03 5
2 99 04 5
3 00 05 5
4 01 06 5
5 02 07 5
one method is to use as.Date with a dplyr chain:
dat %>%
mutate(year1 = as.Date(year1, format = "%y"),
year2 = as.Date(year2, format = "%y")) %>%
mutate(age = year2 - year1)
which returns:
year1 year2 age
1 1998-10-26 2003-10-26 1826 days
2 1999-10-26 2004-10-26 1827 days
3 2000-10-26 2005-10-26 1826 days
4 2001-10-26 2006-10-26 1826 days
5 2002-10-26 2007-10-26 1826 days
p.s. it assumes default day and month for both columns, but it assumes same value for both, so does not affect the difference calculation.

R how to avoid "for" when I want to go through dataframe

give a brief example.
I have data frame data1.
name<-c("John","John","Mike","Amy".....)
nationality<-c("Canada","America","Spain","Japan".....)
data1<-data.frame(name,nationality....)
which mean the people is from different countries
each people is specialize by his name and country, and no repeat.
the second data frame is
name2<-c("John","John","Mike","John",......)
nationality2<-c("Canada","Canada","Canada".....)
score<-c(87,67,98,78,56......)
data2<-data.frame(name2,nationality2,score)
every people is promised to have 5 rows in data2, which means they have 5 scores but they are in random order.
what I want to do is to know every person's 5 scores, but I didn't care what his name is and where he is from.
the final data frame I want to have is
score1 score2 score3 score4 score5
1 89 89 87 78 90
2 ...
3 ...
every row represent one person 5 scores but I don't care who he is.
my data number is so large so I can not use for function.
what can I do?
Although there is an already accepted answer which uses base R I would like to suggest a solution which uses the convenient dcast() function for reshaping from wide to long form instead of using tapply() and repeated calls to rbind():
library(data.table) # CRAN version 1.10.4 used
dcast(setDT(data2)[setDT(data1), on = c(name2 = "name", nationality2 = "nationality")],
name2 + nationality2 ~ paste0("score", rowid(rleid(name2, nationality2))),
value.var = "score")
returns
name2 nationality2 score1 score2 score3 score4 score5
1: Amy Canada 93 91 73 8 79
2: John America 3 77 69 89 31
3: Mike Canada 76 92 46 47 75
It seems to me that's what you're asking:
data1 <- data.frame(name = c("John","Mike","Amy"),
nationality = c("America","Canada","Canada"))
data2 <- data.frame(name2 = rep(c("John","Mike","Amy","Jack","John"),each = 5),
score = sample(100,25), nationality2 =rep(c("America","Canada","Canada","Canada","Canada"),each = 5))
data3 <- merge(data2,data1,by.x=c("name2","nationality2"),by.y=c("name","nationality"))
data3$name_country <- paste(data3$name2,data3$nationality2)
all_scores_list <- tapply(data3$score,data3$name_country,c)
as.data.frame(do.call(rbind,all_scores_list))
# V1 V2 V3 V4 V5
# Amy Canada 57 69 90 81 50
# John America 4 92 75 15 2
# Mike Canada 25 86 51 20 12

Reducing rows and expanding columns of data.frame in R

I have this data.frame in R.
> a <- data.frame(year = c(2001,2001,2001,2001), country = c("Japan", "Japan","US","US"), type = c("a","b","a","b"), amount = c(35,67,39,45))
> a
year country type amount
1 2001 Japan a 35
2 2001 Japan b 67
3 2001 US a 39
4 2001 US b 45
How should I transform this into a data.frame that looks like this?
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Basically I want the number of rows to be the number of (year x country) pairs, and I want to create additional columns for each type.
base solution, but requires renaming columns and rows
reshape(a, v.names="amount", timevar="type", idvar="country", direction="wide")
year country amount.a amount.b
1 2001 Japan 35 67
3 2001 US 39 45
reshape2 solution
library(reshape2)
dcast(a, year+country ~ paste("type", type, sep="."), value.var="amount")
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Another way would be to use spread in the tidyr package and rename in the dplyr package to deliver the expected outcome.
library(dplyr)
library(tidyr)
spread(a,type, amount) %>%
rename(type.a = a, type.b = b)
# year country type.a type.b
#1 2001 Japan 35 67
#2 2001 US 39 45

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources