I have a column with lists of variables.
Seperated by comma plus sometimes values for the variables set by "=".
See picture.
I want the variables as columns and within the columns TRUE/FALSE or 1/0 values plus if there is a value set by "=" an extra column for this value.
I guess it's a similar question to Pandas convert a column of list to dummies but I need it in R.
Since you haven't provided explicit data, I needed to recreate one from your screenshots (please, update at least textual data the next time, it helps recreate your task).
Those chunks of code are explained with comments, they use tidyverse functions from packages included at the top of the chunk. Result is what you asked for with the exception that columns eventnumber_value are named value_eventnumber since naming a variable or column with a name that starts with number is not a good practice.
I don't know what you need the data for, but from my experience the wide format of the data is less useful than wide format for most of the cases. Especially here, since I expect, that one event may happen only for one ID. Thus, dat_pivoted is more convenient to operate on.
library(tibble)
library(tidyr)
library(dplyr)
library(stringr)
dat <- tribble(
~post_event_list, ~date_time,
"239=20.00,200,20149,100,101,102,103,104,105,106,107,108,114,198", "2022-03-01 00:23:50",
"257,159", "2022-03-01 00:02:51",
"201,109,110,111,112", "2022-03-01 00:57:23"
)
dat_pivoted <- dat %>%
mutate(post_event_list = str_split(post_event_list, ",")) %>% # transform comma separated strings into character vectors
unnest_longer(post_event_list) %>% # split characters into separate rows
separate(post_event_list, sep = "=", into = c("var", "val"), fill = "right") %>% # separate variables from values (case of 'X=Y'), put NA as value if there is no value
mutate(val = as.numeric(val)) # treat 'val' column as numeric
dat_values <- dat_pivoted %>%
pivot_wider(id_cols = date_time, names_from = var, names_prefix = "value_", values_from = val) %>% # turn data into wide format -- make a column per each event value, present or not
select(!where(~ all(is.na(.x)))) # select only those values columns, where not every element is NA
dat_indicator <- dat_pivoted %>%
mutate(val = TRUE) %>% # each row indicates a presence of event -- change all values to TRUE
pivot_wider(id_cols = date_time, names_from = var, values_from = val, values_fill = FALSE) # pivot columns again, replacing resulting NAs witth FALSE
dat_transformed <- left_join(dat_indicator, dat_values)
Related
I am having trouble combining multiple rows into 1 row, below is my current data:
I want one row of symptoms for each VAERS_ID. However, because the number of rows each VAERS_ID is inconsistent, I am having trouble.
I have tried this:
test= data %>%
select(VAERS_ID, SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5) %>%
group_by(VAERS_ID) %>%
mutate(Grp = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse
= ",")) %>%
distinct(VAERS_ID, Grp, .keep_all = TRUE)
This gives me the original data, plus another column labeled Grp containing all of the symptoms for each VAERS_ID pasted together, with a comma between each set.
Any help would be appreciated.
Your approach seems right but since data cannot be copied and tested, I am not able to reproduce your error. Some changes suggested, which you can try.
since you want "ALL Symptoms" in 1 place for each VAERS_ID, which is a common real world use case and I face this often. If you don't need original data in output, simply use this
data%>%
group_by(VAERS_ID) %>%
summarise("Symptoms" = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse = ",")
With mutate you get original data since it adds a new column.
To address the warning to ungroup, just added %>%ungroup at end or within summarise add .groups="drop"
I have a data.frame that looks like this:
UID<-c(rep(1:25, 2), rep(26:50, 2))
Group<-c(rep(5, 25), rep(20, 25), rep(-18, 25), rep(-80, 25))
Value<-sample(100:5000, 100, replace=TRUE)
df<-data.frame(UID, Group, Value)
But I need the values separated into new rows so I run this:
df<-pivot_wider(df, names_from = Group,
values_from = Value,
values_fill = list(Value = 0))
Which introduces NULL into the dataset. Sorry, could not figure out a way to get an example dataset with NULL values. Note: this is now a tbl_df tbl data.frame
These aren't great variable names so I run this:
colnames(df)[which(names(df) == "20")] <- "pos20"
colnames(df)[which(names(df) == "5")] <- "pos5"
colnames(df)[which(names(df) == "-18")] <- "neg18"
colnames(df)[which(names(df) == "-80")] <- "neg80"
What I want to be able to do is create a new column (variable) that rowSums across columns. So I run this:
df<-df%>%
replace(is.na(.), 0) %>%
mutate(rowTot = rowSums(.[2:5]))
Which of course works on the example dataset but not on the one with NULL values. I have tried converting NULL to NA using df[df== "NULL"] <- NA but the values do not change. I have tried converting the lists to numeric using as.numeric(as.character(unlist(df[[2]]))) but I get an error telling me I have unequal number of rows, which I guess would be expected.
I realize there might be a better process to get my desired end result, so any suggestions to any of this is most appreciated.
EDIT: Here is a link to the actual dataset which will introduce Null values after using pivot_wider. https://drive.google.com/file/d/1YGh-Vjmpmpo8_sFAtGedxzfCiTpYnKZ3/view?usp=sharing
Difficult to answer with confidence without an actual reproducible example where the error occurs but I am going to take a guess.
I think your pivot_wider steps produces list columns (meaning some values are vectors) and that is why you are getting NULL values. Create a unique row for each Group and then use pivot_wider. Also rowSums has na.rm parameter so you don't need replace.
library(dplyr)
df %>%
group_by(temp) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = temp, values_from = numseeds) %>%
mutate(rowTot = rowSums(.[3:6], na.rm = TRUE))
Please change the column numbers according to your data in rowSums if needed.
I have a long pipe of different filtering and selecting functions, and in that same pipe operation, I would like to rename a column based on the value in the first row of another column. I have to do this for many different data frames, so a pipeline that is agnostic to the name of the data frame would be nice.
This is a small example:
original <- tibble(value = c(1,2,4,6,7), month = 1:5, year = 2018)
what_I_want <- tibble(indicator2018 = c(1,2,4,6,7), month = 1:5, year = 2018)
So if the first row of the column year would have been 2015, then the column name of value would have changed to indicator2015.
This doesn't work:
original %>%
rename(paste0("indicator", .$year[1]) = "value")
original %>%
rename_at(vars(starts_with("value")), list( ~ str_replace(., "value", paste0("indicator", .["year"][1]))))
This works but involves breaking the pipe and (more importantly) requires the name of the data frame in the pipe, so would not scale to many different data frames without manually changing the code.
original2 <- original %>%
rename_at(vars(starts_with("value")), list( ~ str_replace(., "value", paste0("indicator", original$year[1]))))
You need to do some unquoting. This works:
original %>%
rename(!!paste0("indicator", .$year[1]) := "value")
For future reference, I would suggest that you check out the "programming with dplyr" vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html).
I like to combine my values with column names(See current set and required set):
Current set =
- ncol = 9
- nrow = 26814
I want to add the values from SheetNaam to the columns XYEAR to expand my columns and decrease my rows, without losing data or 'NA'. Is this possible in R?
Difficult to explain by text, hope someone will understand my explanation.
We can try with gather and spread. gather the columns that starts_with 'X' followed by numbers, unite the 'SheetNaam', 'key' into a single column and do spread back to 'wide' format
library(tidyverse)
gather(df1, key, val, matches("^X\\d+$"), na.rm = TRUE) %>%
unite(SheetNaam, SheetNaam, key, sep = "") %>%
group_by(SheetNaam) %>%
mutate(rn = row_number()) %>%
spread(SheetNaam, val)
I have a dataframe containing observations for two sets of data (A,B), with dataset and observation type given by the column names :
mydf <- data.frame(meta1=paste0("a",1:2), meta2=paste0("b",1:2),
A_var1 = c(11:12), A_var2 = c("p","r"),
B_var1 = c(21:22), B_var2 = c("x","z"))
I would like to reshape this dataframe so that each row contains observations on one set only. In this long format, set and column names should by given by splitting the original column names at the '_':
mydf2 <- data.frame(meta1=rep(paste0("a",1:2),2),
meta2=rep(paste0("b",1:2),2),
set=c("A","B","A","B"),
var1 = c(11:12),
var2 = c("a","b","c","d"))
I have tried using 'gather' in combination with 'str_split','sub', but unfortunately without success. Could this be done using tideverse functions?
Yes you can do this with tidyverse !
You were close, you need to gather, then separate, then spread.
new_df <- mydf %>%
gather(set, vars, 3:6) %>%
separate(set, into = c('set', 'var'), sep = "_") %>%
spread(var, vars)
hope this helps!