Long to wide with automatic dummy creation and multiple value columns - r

I'm sitting in front of a dataframe that looks like this:
country year Indicator a b c
48996 US 2003 var1 NA NA NA
16953 FR 1988 var2 NA 10664.920 NA
22973 FR 1943 var3 NA 5774.334 NA
8760 CN 1995 var4 8804.565 NA 12750.31
47795 US 2012 var5 NA NA NA
30033 GB 1969 var6 NA 29631.362 NA
25796 FR 1921 var7 NA 14004.520 NA
39534 NL 1941 var8 NA NA NA
42255 NZ 1969 var8 NA NA NA
7249 CN 1995 var9 50635.862 NA 75260.56
What I want to do is basically a long to wide transformation with Indicator as key variable. I would usually use spread() from the tidyr package. However, spread() unfortunately does not accept multiple value columns (in this case a, b and c) and it does not fully do what I want to achieve:
Make the entries of Indicator the new columns
Keep the Country / Year combinations as rows
Creat a UNIQUE row for every old value from a, b and c
Create a Dummy Variable for every "old" value column name (i.e. a,
b, c)
So in the end, the Chinese observations of my example should become
country year var1 [...] var4 [...] var9 dummy.a dummy.b dummy.c
CN 1995 NA 8804.565 50635.862 1 0 0
CN 1995 NA 12750.31 75260.56 0 0 1
As my original dataframe is 58.162x119, I would appreciate something that does not include a lot of manual work :-)
I hope I was clear in what I want to achieve. Thanks for your help!
The above mentioned dataframe can be reproduced using the following code:
structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")

Here's my solution:
require(tidyr)
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0)
The output
# country year var2 var3 var4 var6 var7 var9 a b c
# 1 CN 1995 NA NA 8804.565 NA NA 50635.86 1 0 0
# 2 CN 1995 NA NA 12750.306 NA NA 75260.56 0 0 1
# 3 FR 1921 NA NA NA NA 14004.52 NA 0 1 0
# 4 FR 1943 NA 5774.334 NA NA NA NA 0 1 0
# 5 FR 1988 10664.92 NA NA NA NA NA 0 1 0
# 6 GB 1969 NA NA NA 29631.36 NA NA 0 1 0

dt would be your original data. dt2 is the final output.
dt2 <- dt %>%
gather(Parameter, Value, a:c) %>%
spread(Indicator, Value) %>%
mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>%
filter(Data != 0) %>%
spread(Parameter, Data, fill = 0) %>%
rename(dummy.a = a, dummy.b = b, dummy.c = c)

Related

Is there R codes to organise these data in R? [duplicate]

This question already has answers here:
filter for complete cases in data.frame using dplyr (case-wise deletion)
(7 answers)
Closed 1 year ago.
I would like to remove NA from my data set and then organise them by IDs.
My dataset is similar to this:
df<-read.table (text="ID Name Surname Group A1 A2 A3 Goal Sea
21 Goal Robi A 4 4 4 G No
21 Goal Robi B NA NA NA NA NA
21 Goal Robi C NA NA NA NA NA
21 Goal Robi D 3 4 4 G No
33 Nami Si O NA NA NA NA NA
33 Nami Si P NA NA NA NA NA
33 Nami Si Q 3 4 4 G No
33 Nami Si Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No
98 Sara Bat NC 4 3 2 D No
98 Sara Bat MF NA NA NA NA NA
98 Sara Bat LC NA NA NA NA NA
66 Noor Shor MF NA NA NA NA NA
66 Noor Shor LC NA NA NA NA NA
66 Noor Shor MT1 4 4 4 G No
66 Noor Shor NC1 2 3 3 D No
", header=TRUE)
By removing NA, rows and columns get a datframe with a lack of NA. So I would like to get this table
ID Name Surname Group_1 A1 A2 A3 Goal_1 Sea_1 Group_2 A1_1 A2_2 A3_3 Goal_2 Sea_2
21 Goal Robi A 4 4 4 G No D 3 4 4 G No
33 Nami Si Q 3 4 4 G No Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No NC 4 3 2 D No
66 Noor Shor Mt1 4 4 4 G No NC1 2 3 3 D No
Is it possible to get it. It seems we could do it using pivot_longer, but I do not know ho to get it
search for complete.cases()
final = final[complete.cases(final), ]
A possible solution with the Tidyverse:
df <- structure(list(ID = c(21L, 21L, 21L, 21L, 33L, 33L, 33L, 33L,
98L, 98L, 98L, 98L, 66L, 66L, 66L, 66L), Name = c("Goal", "Goal",
"Goal", "Goal", "Nami", "Nami", "Nami", "Nami", "Sara", "Sara",
"Sara", "Sara", "Noor", "Noor", "Noor", "Noor"), Surname = c("Robi",
"Robi", "Robi", "Robi", "Si", "Si", "Si", "Si", "Bat", "Bat",
"Bat", "Bat", "Shor", "Shor", "Shor", "Shor"), Group = c("A",
"B", "C", "D", "O", "P", "Q", "Z", "MT", "NC", "MF", "LC", "MF",
"LC", "MT1", "NC1"), A1 = c(4L, NA, NA, 3L, NA, NA, 3L, 3L, 4L,
4L, NA, NA, NA, NA, 4L, 2L), A2 = c(4L, NA, NA, 4L, NA, NA, 4L,
3L, 4L, 3L, NA, NA, NA, NA, 4L, 3L), A3 = c(4L, NA, NA, 4L, NA,
NA, 4L, 3L, 4L, 2L, NA, NA, NA, NA, 4L, 3L), Goal = c("G", NA,
NA, "G", NA, NA, "G", "S", "S", "D", NA, NA, NA, NA, "G", "D"
), Sea = c("No", NA, NA, "No", NA, NA, "No", "No", "No", "No",
NA, NA, NA, NA, "No", "No")), class = "data.frame", row.names = c(NA,
-16L))
new_df <- df %>%
drop_na() %>%
group_by(ID) %>%
mutate(n = row_number()) %>%
pivot_wider(
names_from = n,
values_from= c(Group, A1, A2, A3, Goal, Sea)
) %>%
relocate(ends_with("2"), .after= last_col())
print(new_df)
We can group_by the ID columns and then filter out rows with all NAs in the target columns:
df %>% group_by(ID, Name, Surname) %>%
filter(!if_all(A1:Sea, is.na))%>%
slice_head(n=1)
# A tibble: 4 × 9
# Groups: ID, Name, Surname [4]
ID Name Surname Group A1 A2 A3 Goal Sea
<int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
1 21 Goal Robi A 4 4 4 G No
2 33 Nami Si Q 3 4 4 G No
3 66 Noor Shor MT1 4 4 4 G No
4 98 Sara Bat MT 4 4 4 S No

Vectorizing loop operation in R

I have a long-format balanced data frame (df1) that has 7 columns:
df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960,
1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L,
1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L,
1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
15L), class = "data.frame")
Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):
df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000),
Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10,
NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8,
6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6,
NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.
Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:
Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;
Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.
My initial solution with nested for-loops is as follows:
for (i in unique(df1$Product_ID)){
Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
Opposite_Category <- ifelse(Category=="A","B","A")
Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])
ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
for (j in unique(df1$Control_Date)){
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
}
}
The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.
See if this does what you want. I'm using the data.table package since you have a rather large (20M) dataset.
library(data.table)
setDT(df1)
setDT(df2)
# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)
# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]
# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))
# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
, variable := NULL]
# Group by triplet, Var1=count non-NA in value, join with...
# (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))),
by=.(Product_Category, Manufacture_Date, Country_Code)][
df2.long[, .(N = sum(!is.na(value))),
by=.(Manufacture_Date, Country_Code)],
Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]
# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2),
on = .(Product_Category, Manufacture_Date, Country_Code)]
df1
Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
1: 3 A 1940 1961 GHI 4 0
2: 3 A 1940 1962 GHI 4 0
3: 3 A 1940 1963 GHI 4 0
4: 3 A 1940 1964 GHI 4 0
5: 3 A 1940 1965 GHI 4 0
6: 1 A 1950 1961 ABC 6 0
7: 1 A 1950 1962 ABC 6 0
8: 1 A 1950 1963 ABC 6 0
9: 1 A 1950 1964 ABC 6 0
10: 1 A 1950 1965 ABC 6 0
11: 2 B 1960 1961 DEF NA NA
12: 2 B 1960 1962 DEF NA NA
13: 2 B 1960 1963 DEF NA NA
14: 2 B 1960 1964 DEF NA NA
15: 2 B 1960 1965 DEF NA NA
df2
Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1: 5 A 1940 DEF 6 3 6 10 NA
2: 3 A 1940 GHI 10 4 NA 9 7
3: 1 A 1950 ABC 5 NA 8 NA 6
4: 4 A 1950 ABC NA 5 5 NA 4
5: 2 B 1940 DEF NA NA 6 NA NA
6: 6 B 2000 GHI NA NA NA NA NA

How to remove NA in character data in R

I would like to copy the last two columns from each month to the beginning of the next month. I did it as follows (below), but the data contains NA and when I change it to character, the program breaks down. How do I copy columns to keep their type?
My code:
library(readxl)
library(tibble)
df<- read_excel("C:/Users/Rezerwa/Documents/Database.xlsx")
df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
My data:
df
# A tibble: 10 x 13
Product January...2 January...3 January...4 January...5 February...6 February...7 February...8 February...9 March...10 March...11 March...12 March...13
<chr> <lgl> <lgl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 a NA NA 754.00 4 754.00 4 754.00 4 754.00 4 754.00 4
2 b NA NA 706.00 3 706.00 3 706.00 3 706.00 3 706.00 3
3 c NA NA 517.00 3 517.00 3 517.00 3 517.00 3 517.00 3
4 d NA NA 1466.00 9 1466.00 9 1466.00 9 1466.00 9 1466.00 9
5 e NA NA 543.00 8 543.00 8 543.00 8 543.00 8 543.00 8
6 f NA NA NA NA NA NA NA NA NA NA NA NA
7 g NA NA NA NA NA NA NA NA NA NA NA NA
8 h NA NA NA NA NA NA NA NA NA NA NA NA
9 i NA NA 1466.00 8 NA NA NA NA NA NA NA NA
10 j NA NA NA NA 543.00 3 NA NA NA NA NA NA
My error:
> df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
> df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
Warning message:
In eval_tidy(xs[[i]], unique_output) : NAs introduced by coercion
Using base R we can split the columns based on the prefix of their names, select last two columns from each group and cbind to original df.
df1 <- cbind(df, do.call(cbind, lapply(split.default(df[-1],
sub("\\..*", "", names(df)[-1])), function(x) {n <- ncol(x);x[, c(n-1, n)]})))
To get data in order, we can do
cbind(df1[1], df1[-1][order(match(sub("\\..*", "", names(df1)[-1]), month.name))])
data
df <- structure(list(Product = structure(1:10, .Label = c("a", "b",
"c", "d", "e", "f", "g", "h", "i", "j"), class = "factor"), January...2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), January...3 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), January...4 = c(754, 706, 517,
1466, 543, NA, NA, NA, 1466, NA), January...5 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, 8L, NA), February...6 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, 543), February...7 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, 3L), February...8 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, NA), February...9 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, NA), March...10 = c(754, 706, 517, 1466,
543, NA, NA, NA, NA, NA), March...11 = c(4L, 3L, 3L, 9L, 8L,
NA, NA, NA, NA, NA), March...12 = c(754, 706, 517, 1466, 543,
NA, NA, NA, NA, NA), March...13 = c(4L, 3L, 3L, 9L, 8L, NA, NA,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))

Add value from previous row under conditions

I have a df data and I would like to add to a new column a value that exist in a previous column and row if the factor is the same.
Here is a sample:
data <- structure(list(Id = c("a", "b", "b", "b", "a", "a", "b", "b",
"a", "a"), duration.minutes = c(NA, 139L, 535L, 150L, NA, NA,
145L, 545L, 144L, NA), event = structure(c(1L, 4L, 3L, 4L, 2L,
1L, 4L, 3L, 4L, 2L), .Label = c("enter", "exit", "stop", "trip"
), class = "factor")), .Names = c("Id", "duration.minutes", "event"
), class = "data.frame", row.names = 265:274)
and I would like to add a new column called "duration.minutes.past" like this:
data <- structure(list(Id = c("a", "b", "b", "b", "a", "a", "b", "b",
"a", "a"), duration.minutes = c(NA, 139L, 535L, 150L, NA, NA,
145L, 545L, 144L, NA), event = structure(c(1L, 4L, 3L, 4L, 2L,
1L, 4L, 3L, 4L, 2L), .Label = c("enter", "exit", "stop", "trip"
), class = "factor"), duration.minutes.past = c(NA, NA, 139,
NA, NA, NA, NA, 145, NA, NA)), .Names = c("Id", "duration.minutes",
"event", "duration.minutes.past"), row.names = 265:274, class = "data.frame")
As you can see, I added in this new column duration.minutes.past the duration.minutes of the previous trip for the same Id. if the Id is different or if is it not a stop, then the value for duration.minutes.past is NA.
Help is much appreciated!
A possible solution using dplyr,
library(dplyr)
df %>%
group_by(Id) %>%
mutate(new = replace(lag(duration.minutes), event != 'stop', NA))
#Source: local data frame [10 x 4]
#Groups: Id [2]
# Id duration.minutes event new
# <chr> <int> <fctr> <int>
#1 a NA enter NA
#2 b 139 trip NA
#3 b 535 stop 139
#4 b 150 trip NA
#5 a NA exit NA
#6 a NA enter NA
#7 b 145 trip NA
#8 b 545 stop 145
#9 a 144 trip NA
#10 a NA exit NA
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'Id', we create the lag column of 'duration.minutes' using shift), then change the value to 'NA' where the 'event' is not equal to 'stop'
library(data.table)
setDT(data)[, duration.minutes.past := shift(duration.minutes),
Id][event != "stop", duration.minutes.past := NA][]
data
# Id duration.minutes event duration.minutes.past
#1: a NA enter NA
#2: b 139 trip NA
#3: b 535 stop 139
#4: b 150 trip NA
#5: a NA exit NA
#6: a NA enter NA
#7: b 145 trip NA
#8: b 545 stop 145
#9: a 144 trip NA
#10: a NA exit NA
Or this can be done with base R using ave
data$duration.minutes.past <- with(data, NA^(event != "stop") *
ave(duration.minutes, Id, FUN = function(x) c(NA, x[-length(x)])))

R: Replacing a factor with an integer value in numerous cells across numerous columns

So, my challenge has been to convert a raw scale csv to a scored csv. Within numerous columns, the file has cells filled with "Strongly Agree" to "Strongly Disagree", 6 levels. These factors need to be converted in integers 5 to 0 respectively.
I have tried unsuccessfully to use sapply and convert the table to a string. Sapply works on the vector, but it destroys the table structure.
Method 1:
dat$Col<-sapply(dat$Col,switch,'Strongly Disagree'=0,'Disagree'=1,'Slightly Disagree'=2,'Slightly Agree'=3,'Agree'=4, 'Strongly Agree'=5)
My second approach is to convert the csv into a string. When I examined the dput output, I saw the area I wanted to target that started with a .Label="","Strongly Agree"... Mistake. My changes did not result in a useful outcome.
My third approach came from the internet gods of destruction who seemed to express that gsub() might handle the string approach as well. Nope, again the underlying table structure was destroyed.
Method #3: Convert into a string and pattern match
dat <- textConnection("control/Surveys/StudyDat_1.csv")
#Score Scales
##"Strongly Agree"= 5
##"Agree"= 4
##"Strongly Disagree" = 0
#levels(dat$Col) <- gsub("Strongly Agree", "5", levels(dat$Col))
df<- gsub("Strongly Agree", "5",dat)
dat<-read.csv(textConnection(df),header=TRUE)
In the end, I am wanting to replace ALL "Strongly Agree" to 5 across numerous columns without the consequence of destroying the retrievability of the data.
Maybe I used the wrong search string and you know the resource I need to address this problem. I would rather avoid ALL character vector approaches as that this would require labeling each column if you provide a code response. It will need to go across ALL COLUMNS.
Thanks
Data Sample Problem
structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L, 8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA), .Label = c("", "Slightly Disagree"
), class = "factor"), B = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Disagree", "Strongly Agree"
), class = "factor"), C = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Agree", "Disagree"), class = "factor"),
D = structure(c(NA, NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA
), .Label = c("", "Agree"), class = "factor"), E = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 3L, NA, NA), .Label = c("", "Agree",
"Strongly Disagree"), class = "factor")), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-11L))
Data Sample Solution
df<-dget(structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L,8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = c(NA, NA, 2L, NA,
NA, NA, NA, NA, 2L, NA, NA), B = c(NA, NA, 1L, NA, NA, NA, NA,
NA, 5L, NA, NA), C = c(NA, NA, 4L, NA, NA, NA, NA, NA, 1L, NA,
NA), D = c(NA, NA, 4L, NA, NA, NA, NA, NA, 4L, NA, NA), E = c(NA,
NA, 4L, NA, NA, NA, NA, NA, 0L, NA, NA)), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,-11L)))
we can use factor with levels specified
nm1 <- c('Strongly Disagree', 'Disagree',
'Slightly Disagree','Slightly Agree','Agree', 'Strongly Agree')
factor(dat$col, levels = nm1,
labels = 0:5))
If there are multiple factor columns with the same levels, identify the factor columns ('i1'), loop through it with lapply and specify the levels and labels.
i1 <- sapply(dat, is.factor)
dat[i1] <- lapply(dat[i1], factor, levels = nm1, labels= 0:5)
Update
Using the OP's dput output
dat[-(1:2)] <- lapply(dat[-(1:2)], factor, levels = nm1, labels = 0:5)
dat
# last_updated feedback A B C D E
#1 2016-05-13T12:54:22.757251Z NA <NA> <NA> <NA> <NA> <NA>
#2 2016-05-13T12:53:56.704184Z NA <NA> <NA> <NA> <NA> <NA>
#3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
#4 2016-05-13T12:54:09.273359Z NA <NA> <NA> <NA> <NA> <NA>
#5 2016-05-24T02:26:04.770799Z NA <NA> <NA> <NA> <NA> <NA>
#6 2016-05-16T16:45:10.623410Z NA <NA> <NA> <NA> <NA> <NA>
#7 2016-05-16T16:46:55.122257Z NA <NA> <NA> <NA> <NA> <NA>
#8 2016-05-16T16:47:14.160793Z NA <NA> <NA> <NA> <NA> <NA>
#9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
#10 2016-05-14T12:44:31.736469Z NA <NA> <NA> <NA> <NA> <NA>
#11 2016-05-14T12:44:13.474992Z NA <NA> <NA> <NA> <NA> <NA>
Another option is set from data.table
library(data.table)
for(j in names(dat)[-(1:2)]){
set(dat, i = NULL, j= j, value = factor(dat[[j]], levels = nm1, labels = 0:5))
}
I would just match each target column vector into a precomputed character vector to get an integer index. You can subtract 1 afterward to change the range from 1:6 to 0:5.
## define desired value order, ascending
o <- c(
'Strongly Disagree',
'Disagree',
'Slightly Disagree',
'Slightly Agree',
'Agree',
'Strongly Agree'
);
## convert target columns
for (cn in names(df)[-(1:2)]) df[[cn]] <- match(as.character(df[[cn]]),o)-1L;
df;
## last_updated feedback A B C D E
## 1 2016-05-13T12:54:22.757251Z NA NA NA NA NA NA
## 2 2016-05-13T12:53:56.704184Z NA NA NA NA NA NA
## 3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
## 4 2016-05-13T12:54:09.273359Z NA NA NA NA NA NA
## 5 2016-05-24T02:26:04.770799Z NA NA NA NA NA NA
## 6 2016-05-16T16:45:10.623410Z NA NA NA NA NA NA
## 7 2016-05-16T16:46:55.122257Z NA NA NA NA NA NA
## 8 2016-05-16T16:47:14.160793Z NA NA NA NA NA NA
## 9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
## 10 2016-05-14T12:44:31.736469Z NA NA NA NA NA NA
## 11 2016-05-14T12:44:13.474992Z NA NA NA NA NA NA
Previous answers might meet your needs, but note that changing the labels of a factor isn't the same as changing a factor to an integer variable. One possibility would be to use ifelse (I've made a new data frame as the one you posted didn't actually have variables with these levels in it):
lev <- c('Strongly disagree', 'Disagree', 'Slightly disagree', 'Slightly agree', 'Agree', 'Strongly agree')
dta <- sample(lev, 55, replace = TRUE)
dta <- data.frame(matrix(dta, nrow = 11))
names(dta) <- LETTERS[1:5]
f_to_int <- function(f) {
if (is.factor(f)){
ifelse(f == 'Strongly disagree', 0,
ifelse(f == 'Disagree', 1,
ifelse(f == 'Slightly disagree', 2,``
ifelse(f == 'Slightly agree', 3,
ifelse(f == 'Agree', 4,
ifelse(f == 'Strongly agree', 5, f))))))
} else f
}
dta2 <- sapply(dta, f_to_int)
Note that this returns a matrix, but it is easily converted to a data frame if necessary.

Resources