Related
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 1 year ago.
I have a table which has 5 columns (ID, var, state, loc and position). The var column contains a description of a certain variant e.g. var1. Within the table there are multiple rows which include var 1 but they have a different state and position. What I want to do is make a new table where each var is included only once and the position is included in two columns based on its state.
For example, say I have four var1 rows; two with the state H and two with the state h. In the new table I need the columns to be: sample - var - loc - position if H and position if h - such that all the information for var 1 is in one row. I would need to be able to do this for every single variant in my original data set.
Current data example
structure(list(ID = c(1234L, 1234L, 1234L, 1234L, 5678L, 5678L,
NA, NA, NA, NA), var = c("var1", "var1", "var1", "var1", "var2",
"var2", NA, NA, NA, NA), state = c("H", "H", "h", "h", "H", "h",
NA, NA, NA, NA), loc = c(4L, 4L, 4L, 4L, 12L, 12L, NA, NA, NA,
NA), position = c(6000L, 6002L, 6004L, 6006L, 3002L, 3004L, NA,
NA, NA, NA)), row.names = c("1", "2", "3", "4", "5", "6", "NA",
"NA.1", "NA.2", "NA.3"), class = "data.frame")
wanted format
structure(list(V1 = c("ID", "1234", "5678", NA, NA, NA, NA, NA,
NA, NA), V2 = c("var1", "var1", "var2", NA, NA, NA, NA, NA, NA,
NA), V3 = c("loc", "4", "12", NA, NA, NA, NA, NA, NA, NA), V4 = c("state H",
"6000 6002", "3002", NA, NA, NA, NA, NA, NA, NA), V5 = c("state h",
"6004 6006", "3004", NA, NA, NA, NA, NA, NA, NA)), row.names = c("1",
"2", "3", "NA", "NA.1", "NA.2", "NA.3", "NA.4", "NA.5", "NA.6"
), class = "data.frame")
Any guidance would be appreciate
The answer to your question is likely revolving around tidyr::pivot_wider
I changed the example data because I believe yours was inconsistent.
Data
df<-structure(list(ID = c(1234L, 1234L, 1234L, 1234L, 5678L, 5678L
), var = c("var1", "var1", "var1", "var1", "var2", "var2"), state = c("H",
"H", "h", "h", "H", "h"), loc = c(4L, 4L, 4L, 4L, 12L, 12L),
position = c(6000L, 6002L, 6004L, 6006L, 3002L, 3004L)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
df
ID var state loc position
1 1234 var1 H 4 6000
2 1234 var1 H 4 6002
3 1234 var1 h 4 6004
4 1234 var1 h 4 6006
5 5678 var2 H 12 3002
6 5678 var2 h 12 3004
Answer
library(tidyr)
df %>% pivot_wider(names_from = state,
values_from = position,
values_fn = toString)
# A tibble: 2 × 5
ID var loc H h
<int> <chr> <int> <chr> <chr>
1 1234 var1 4 6000, 6002 6004, 6006
2 5678 var2 12 3002 3004
This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
I have a data.frame, in this format:
A w x y z
0.23 1 NA NA NA
0.12 NA 2 NA NA
0.45 NA 2 NA NA
0.89 NA NA 3 NA
0.12 NA NA NA 4
And I want to collapse w:x:y:z into a single column, while removing NA's. Desired result:
A Comb
0.23 1
0.12 2
0.45 2
0.89 3
0.12 4
My approach so far is:
df %>% unite("Comb", w:x:y:z, na.rm=TRUE, remove=TRUE)
However, "Comb" is being populated with strings such as 1_NA_NA_NA and NA_NA_NA_4 i.e. it is not removing the NA's. I've tried switching to character NA's, but that leads to bizarre and unpredictable results. What am I doing wrong?
I'd also like to be able to do this when the original data.frame is populated with strings (in place of the numbers). Is there a method for this?
Using dplyr::coalesce we can do the following:
df %>%
mutate(Comb = coalesce(w,x,y,z)) %>%
select(A, Comb)
which gives the following output:
A Comb
<dbl> <dbl>
1 0.23 1
2 0.12 2
3 0.45 2
4 0.89 3
5 0.12 4
In unite, na.rm does not remove integer/factor columns.
Convert them to the character and then use unite.
library(dplyr)
df %>%
mutate_at(vars(w:z), as.character) %>%
tidyr::unite('comb', w:z, na.rm = TRUE)
# A comb
#1 0.23 1
#2 0.12 2
#3 0.45 2
#4 0.89 3
#5 0.12 4
data
df <- structure(list(A = c(0.23, 0.12, 0.45, 0.89, 0.12), w = c(1L,
NA, NA, NA, NA), x = c(NA, 2L, 2L, NA, NA), y = c(NA, NA, NA,
3L, NA), z = c(NA, NA, NA, NA, 4L)), class = "data.frame",
row.names = c(NA, -5L))
Another option is fcoalesce from data.table
library(data.table)
setDT(df)[, .(A, Comb = fcoalesce(w, x, y, z))]
data
df <- structure(list(A = c(0.23, 0.12, 0.45, 0.89, 0.12), w = c(1L,
NA, NA, NA, NA), x = c(NA, 2L, 2L, NA, NA), y = c(NA, NA, NA,
3L, NA), z = c(NA, NA, NA, NA, 4L)), class = "data.frame",
row.names = c(NA, -5L))
Using na.omit.
dat <- transform(dat[1], Comb=apply(dat[-1], 1, na.omit))
# A Comb
# 1 0.23 1
# 2 0.12 2
# 3 0.45 2
# 4 0.89 3
# 5 0.12 4
Data
dat <- structure(list(A = c(0.23, 0.12, 0.45, 0.89, 0.12), w = c(1L,
NA, NA, NA, NA), x = c(NA, 2L, 2L, NA, NA), y = c(NA, NA, NA,
3L, NA), z = c(NA, NA, NA, NA, 4L)), row.names = c(NA, -5L), class = "data.frame")
I have variables with the names "VA01_01", "VA01_02" etc. and "VA02_01", "VA02_02". Those variables with the prefix VA01 are data from female participants, those with the prefix VA02 are from male participants. Male participants, for example, have NAs in the variables VA01. I already have a factor with values for sex.
What I'd like to do is create a new set of variables that take over the values from both variable types. That is, if it's a male participant, he gets the values of the VA02 variables in that set of variables. So the new set of variables won't have any NAs any more because it won't be based on sex.
Does anyone have a simple solution for that question? I don't know if reshape is the answer because I don't really want to transform my data frame into long format.
Here how it looks like at the beginning:
structure(list(sex = structure(c(1L, 2L, 1L, 2L), .Label = c("female",
"male"), class = "factor"), VA01_01 = c(1, NA, 2, NA), VA01_02 = c(4,
NA, 4, NA), VA02_01 = c(NA, 3, NA, 4), VA02_02 = c(NA, 5, NA,
3)), .Names = c("sex", "VA01_01", "VA01_02", "VA02_01", "VA02_02"
), row.names = c(NA, -4L), class = "data.frame")
And here at the end (I'd like to keep the original variables):
structure(list(sex = structure(c(1L, 2L, 1L, 2L), .Label = c("female",
"male"), class = "factor"), VA_tot_01 = c(1, 3, 2, 4), VA_tot_02 = c(4,
5, 4, 3), VA01_01 = c(1, NA, 2, NA), VA01_02 = c(4, NA, 4, NA
), VA02_01 = c(NA, 3, NA, 4), VA02_02 = c(NA, 5, NA, 3)), .Names = c("sex",
"VA_tot_01", "VA_tot_02", "VA01_01", "VA01_02", "VA02_01", "VA02_02"
), row.names = c(NA, -4L), class = "data.frame")
Considering the VAR01s and VAR02s don't overlap, you could simply create another variables VAR_tot_xx including the original values from both. It would be something like this:
new_vars <- function(df) {
vars <- unique(gsub(
pattern = ".*_",
replacement = "_",
x = grep(
pattern = "_[0-9]{2}$",
x = names(df),
value = TRUE
)
))
for (i in vars) {
new_name <- paste0("VA_tot", i)
male_name <- paste0("VA01", i)
female_name <- paste0("VA02", i)
df[[new_name]] <- NA
df[[new_name]][!is.na(df[[female_name]])] <-
df[[female_name]][!is.na(df[[female_name]])]
df[[new_name]][!is.na(df[[male_name]])] <-
df[[male_name]][!is.na(df[[male_name]])]
}
return(df)
}
It could probably get prettier than this, but this does the job.
c <- structure(
list(
sex = structure(
c(1L, 2L, 1L, 2L),
.Label = c("female", "male"),
class = "factor"
),
VA01_01 = c(1, NA, 2, NA),
VA01_02 = c(4, NA, 4, NA),
VA02_01 = c(NA, 3, NA, 4),
VA02_02 = c(NA, 5, NA, 3)
),
.Names = c("sex", "VA01_01", "VA01_02", "VA02_01", "VA02_02"),
row.names = c(NA, -4L),
class = "data.frame"
)
new_vars(c)
# sex VA01_01 VA01_02 VA02_01 VA02_02 VA_tot_01 VA_tot_02
# 1 female 1 4 NA NA 1 4
# 2 male NA NA 3 5 3 5
# 3 female 2 4 NA NA 2 4
# 4 male NA NA 4 3 4 3
I'm trying to build a grouped bar chart in R. I have pasted the dataframe below. I have been using plotly to build the chart. The problem is, the numbers on Y axis are not proper, as in they do not increase in ascending order. I've also posted an image of graph formed.
Can someone please point out, where I'm going wrong?
Dataframe
chart.supp.part.defect.matrix
Supplier PaintMarking45 Seal78 AirConditioning57 Engine34 CargoCompartment543 Insulation11
1 HJRU 8 <NA> <NA> 1 <NA> <NA>
2 DJDU <NA> 1 <NA> <NA> <NA> <NA>
3 DEF7 <NA> 3 54 <NA> <NA> <NA>
4 A23 <NA> <NA> <NA> 7 <NA> <NA>
5 A52 3 <NA> <NA> <NA> 2 <NA>
6 FJUE 65 <NA> 1 <NA> <NA> 11
7 A31 <NA> 1 5 <NA> <NA> <NA>
8 DJHD <NA> <NA> <NA> <NA> <NA> <NA>
9 A38 4 <NA> 22 <NA> <NA> <NA>
Code to build chart
title <- paste( "Supplier vs Defect")
p3 <- plot_ly(chart.supp.part.defect.matrix, x = ~Supplier, y = ~PaintMarking45, type = 'bar', name = 'Paint/Marking-45') %>%
add_trace(y = ~Seal78,name = 'Seal-78') %>%
add_trace(y = ~AirConditioning57,name = 'Air conditioning - 57') %>%
add_trace(y = ~Engine34,name = 'Engine-34') %>%
add_trace(y = ~CargoCompartment543,name = 'Cargo compartment-543') %>%
add_trace(y = ~Insulation11 ,name = 'Insulation -11') %>%
add_trace(y = ~Insulation6,name = 'Insulation-6') %>%
add_trace(y = ~Engine11,name = 'Engine-11') %>%
add_trace(y = ~Propulsion32,name = 'Propulsion-32') %>%
layout(yaxis = list(title = 'Defect Count'), barmode = 'group') %>%
layout(title = title)
ggplotly(p3)
Chart
Edit
dput(chart.supp.part.defect.matrix)
structure(list(Supplier = structure(c(9L, 6L, 5L, 1L, 4L, 8L,
2L, 7L, 3L), .Label = c(" A23", " A31", " A38", " A52", " DEF7",
"DJDU", "DJHD", "FJUE", "HJRU"), class = "factor"), PaintMarking45 = structure(c(4L,
NA, NA, NA, 1L, 3L, NA, NA, 2L), .Label = c("3", "4", "65", "8"
), class = "factor"), Seal78 = structure(c(NA, 1L, 2L, NA, NA,
NA, 1L, NA, NA), .Label = c("1", "3"), class = "factor"), AirConditioning57 = structure(c(NA,
NA, 4L, NA, NA, 1L, 3L, NA, 2L), .Label = c("1", "22", "5", "54"
), class = "factor"), Engine34 = structure(c(1L, NA, NA, 2L,
NA, NA, NA, NA, NA), .Label = c("1", "7"), class = "factor"),
CargoCompartment543 = structure(c(NA, NA, NA, NA, 1L, NA,
NA, NA, NA), .Label = "2", class = "factor"), Insulation11 = structure(c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA), .Label = "11", class = "factor"),
Insulation6 = structure(c(NA, NA, NA, NA, NA, NA, 1L, NA,
NA), .Label = "7", class = "factor"), Engine11 = structure(c(NA,
NA, NA, NA, NA, NA, 2L, 1L, NA), .Label = c("54", "8"), class = "factor"),
Propulsion32 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,
1L), .Label = "2", class = "factor")), .Names = c("Supplier",
"PaintMarking45", "Seal78", "AirConditioning57", "Engine34",
"CargoCompartment543", "Insulation11", "Insulation6", "Engine11",
"Propulsion32"), row.names = c(NA, -9L), class = "data.frame")
In addition to Adam Spannbauer's approach you can also force Plotly to interpret the data as numbers by setting the yaxis type to linear
layout(yaxis=list(type='linear'))
as #neilfws mentioned in a comment the issue is that your y access data is being built off of factors. You can attempt to fix this on your data read (as #neilfws mentioned) or coerce your data to numeric before plotting. Below is how you can do the latter.
chart.supp.part.defect.matrix[,2:10] <- lapply(chart.supp.part.defect.matrix[,2:10], as.numeric)
p3 <- plot_ly(chart.supp.part.defect.matrix, x = ~Supplier, y = ~PaintMarking45, type = 'bar', name = 'Paint/Marking-45') %>%
add_trace(y = ~Seal78,name = 'Seal-78') %>%
add_trace(y = ~AirConditioning57,name = 'Air conditioning - 57') %>%
add_trace(y = ~Engine34,name = 'Engine-34') %>%
add_trace(y = ~CargoCompartment543,name = 'Cargo compartment-543') %>%
add_trace(y = ~Insulation11 ,name = 'Insulation -11') %>%
add_trace(y = ~Insulation6,name = 'Insulation-6') %>%
add_trace(y = ~Engine11,name = 'Engine-11') %>%
add_trace(y = ~Propulsion32,name = 'Propulsion-32') %>%
layout(yaxis = list(title = 'Defect Count'), barmode = 'group') %>%
layout(title = title)
p3
Additionally, you don't need to call ggplotly in this case. That function is only needed when you want to build your plot using ggplot2 and then add plotly's interactivity to the ggplot object.
So, my challenge has been to convert a raw scale csv to a scored csv. Within numerous columns, the file has cells filled with "Strongly Agree" to "Strongly Disagree", 6 levels. These factors need to be converted in integers 5 to 0 respectively.
I have tried unsuccessfully to use sapply and convert the table to a string. Sapply works on the vector, but it destroys the table structure.
Method 1:
dat$Col<-sapply(dat$Col,switch,'Strongly Disagree'=0,'Disagree'=1,'Slightly Disagree'=2,'Slightly Agree'=3,'Agree'=4, 'Strongly Agree'=5)
My second approach is to convert the csv into a string. When I examined the dput output, I saw the area I wanted to target that started with a .Label="","Strongly Agree"... Mistake. My changes did not result in a useful outcome.
My third approach came from the internet gods of destruction who seemed to express that gsub() might handle the string approach as well. Nope, again the underlying table structure was destroyed.
Method #3: Convert into a string and pattern match
dat <- textConnection("control/Surveys/StudyDat_1.csv")
#Score Scales
##"Strongly Agree"= 5
##"Agree"= 4
##"Strongly Disagree" = 0
#levels(dat$Col) <- gsub("Strongly Agree", "5", levels(dat$Col))
df<- gsub("Strongly Agree", "5",dat)
dat<-read.csv(textConnection(df),header=TRUE)
In the end, I am wanting to replace ALL "Strongly Agree" to 5 across numerous columns without the consequence of destroying the retrievability of the data.
Maybe I used the wrong search string and you know the resource I need to address this problem. I would rather avoid ALL character vector approaches as that this would require labeling each column if you provide a code response. It will need to go across ALL COLUMNS.
Thanks
Data Sample Problem
structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L, 8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA), .Label = c("", "Slightly Disagree"
), class = "factor"), B = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Disagree", "Strongly Agree"
), class = "factor"), C = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Agree", "Disagree"), class = "factor"),
D = structure(c(NA, NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA
), .Label = c("", "Agree"), class = "factor"), E = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 3L, NA, NA), .Label = c("", "Agree",
"Strongly Disagree"), class = "factor")), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-11L))
Data Sample Solution
df<-dget(structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L,8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = c(NA, NA, 2L, NA,
NA, NA, NA, NA, 2L, NA, NA), B = c(NA, NA, 1L, NA, NA, NA, NA,
NA, 5L, NA, NA), C = c(NA, NA, 4L, NA, NA, NA, NA, NA, 1L, NA,
NA), D = c(NA, NA, 4L, NA, NA, NA, NA, NA, 4L, NA, NA), E = c(NA,
NA, 4L, NA, NA, NA, NA, NA, 0L, NA, NA)), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,-11L)))
we can use factor with levels specified
nm1 <- c('Strongly Disagree', 'Disagree',
'Slightly Disagree','Slightly Agree','Agree', 'Strongly Agree')
factor(dat$col, levels = nm1,
labels = 0:5))
If there are multiple factor columns with the same levels, identify the factor columns ('i1'), loop through it with lapply and specify the levels and labels.
i1 <- sapply(dat, is.factor)
dat[i1] <- lapply(dat[i1], factor, levels = nm1, labels= 0:5)
Update
Using the OP's dput output
dat[-(1:2)] <- lapply(dat[-(1:2)], factor, levels = nm1, labels = 0:5)
dat
# last_updated feedback A B C D E
#1 2016-05-13T12:54:22.757251Z NA <NA> <NA> <NA> <NA> <NA>
#2 2016-05-13T12:53:56.704184Z NA <NA> <NA> <NA> <NA> <NA>
#3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
#4 2016-05-13T12:54:09.273359Z NA <NA> <NA> <NA> <NA> <NA>
#5 2016-05-24T02:26:04.770799Z NA <NA> <NA> <NA> <NA> <NA>
#6 2016-05-16T16:45:10.623410Z NA <NA> <NA> <NA> <NA> <NA>
#7 2016-05-16T16:46:55.122257Z NA <NA> <NA> <NA> <NA> <NA>
#8 2016-05-16T16:47:14.160793Z NA <NA> <NA> <NA> <NA> <NA>
#9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
#10 2016-05-14T12:44:31.736469Z NA <NA> <NA> <NA> <NA> <NA>
#11 2016-05-14T12:44:13.474992Z NA <NA> <NA> <NA> <NA> <NA>
Another option is set from data.table
library(data.table)
for(j in names(dat)[-(1:2)]){
set(dat, i = NULL, j= j, value = factor(dat[[j]], levels = nm1, labels = 0:5))
}
I would just match each target column vector into a precomputed character vector to get an integer index. You can subtract 1 afterward to change the range from 1:6 to 0:5.
## define desired value order, ascending
o <- c(
'Strongly Disagree',
'Disagree',
'Slightly Disagree',
'Slightly Agree',
'Agree',
'Strongly Agree'
);
## convert target columns
for (cn in names(df)[-(1:2)]) df[[cn]] <- match(as.character(df[[cn]]),o)-1L;
df;
## last_updated feedback A B C D E
## 1 2016-05-13T12:54:22.757251Z NA NA NA NA NA NA
## 2 2016-05-13T12:53:56.704184Z NA NA NA NA NA NA
## 3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
## 4 2016-05-13T12:54:09.273359Z NA NA NA NA NA NA
## 5 2016-05-24T02:26:04.770799Z NA NA NA NA NA NA
## 6 2016-05-16T16:45:10.623410Z NA NA NA NA NA NA
## 7 2016-05-16T16:46:55.122257Z NA NA NA NA NA NA
## 8 2016-05-16T16:47:14.160793Z NA NA NA NA NA NA
## 9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
## 10 2016-05-14T12:44:31.736469Z NA NA NA NA NA NA
## 11 2016-05-14T12:44:13.474992Z NA NA NA NA NA NA
Previous answers might meet your needs, but note that changing the labels of a factor isn't the same as changing a factor to an integer variable. One possibility would be to use ifelse (I've made a new data frame as the one you posted didn't actually have variables with these levels in it):
lev <- c('Strongly disagree', 'Disagree', 'Slightly disagree', 'Slightly agree', 'Agree', 'Strongly agree')
dta <- sample(lev, 55, replace = TRUE)
dta <- data.frame(matrix(dta, nrow = 11))
names(dta) <- LETTERS[1:5]
f_to_int <- function(f) {
if (is.factor(f)){
ifelse(f == 'Strongly disagree', 0,
ifelse(f == 'Disagree', 1,
ifelse(f == 'Slightly disagree', 2,``
ifelse(f == 'Slightly agree', 3,
ifelse(f == 'Agree', 4,
ifelse(f == 'Strongly agree', 5, f))))))
} else f
}
dta2 <- sapply(dta, f_to_int)
Note that this returns a matrix, but it is easily converted to a data frame if necessary.