I would like to split a list column element into individual columns.
For example, in the starwars dataset,
data("starwars")
I would want this list column (the entry in row 7)
c("Attack of the Clones", "Revenge of the Sith", "A New Hope")
To be broken into columns A,B,C... with the values of the movies
A B C D ...
Attack of the Clones Revenge of the Sith A New Hope NA ...
I have kind of hacked together a way to do this with
starwars %>% separate(films, into= letters[1:7],sep = ",")
Which would result in an output of
A B C D ...
c("Attack of the Clones" "Revenge of the Sith" "A New Hope") NA ...
But this will require some additional scrubbing, and I don't think this is general. Is there a way to do this in one swoop?
The 'films' column is a list of vectors. If we wanted to create data.frame with 7 columns i.e. maximum length of the 'films' and store it as list, assign the length to maximum length from the whole column, convert it to a data.frame
library(tidyverse)
mx <- max(lengths(starwars$films))
starwars %>%
mutate(films = map(films, ~ `length<-`(.x, mx) %>%
as.data.frame.list %>%
set_names(LETTERS[seq_len(mx)]))) %>%
unnest(films)
Or another option is pull the 'films' column, convert it to tibble within n map, bind with the columns of 'starwars' except the 'films'
starwars %>%
pull(films) %>%
map_df(~ t(.x) %>%
as_tibble) %>%
bind_cols(starwars %>%
select(-films), .)
Related
I have two data frames with spectral bands from a satellite, redDF and nirDF. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF and nirDF a new value is calculated according to some formula.
Here is a data sample:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
and this is the formula:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
I hoped I would be able to do something like:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
But that doesn't work, as dplyr doesn't know what the nir argument of getNDVI should be. I have seen solutions for accessing other data frames in mutate() by using the $COLNAME indexer, but since I have 197 columns, that is not an option here.
I would approach this with a for loop, though I know it does not make best use of functionality like across.
First we create a list of the columns we want to iterate over:
cols_to_iterate_over = redDF %>%
select(starts_with("X") %>%
colnames()
Then we join on id and ensure columns are named according to source dataset:
joined_df = redDF %>%
inner_join(nirDF, by = "id", prefix = c("_red","_nir"))
So joined_df should have columns like:
id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...
Then we can loop over these:
for(col in cols_to_iterate_over){
# columns for calculation
red_col = paste0(col,"_red") %>% sym()
nir_col = paste0(col,"_nir") %>% sym()
out_col = col %>% sym()
# calculate
joined_df = joined_df %>%
mutate(
!!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
digits = 4)
) %>%
select(-!!red_col, -!!nir_col)
}
Explanation: We can use text strings as variable names if we turn them into symbols and then !! them.
sym() turns text into symbols,
!! inside dplyr commands turns symbols into code,
and := is equivalent to = but permits us to have !! on the left-hand side.
Sorry, this is slightly old syntax. For the current approaches see programming with dplyr.
In its most basic form, you can just do this:
round((nirDF - redDF)/(nirDF + redDF), digits = 4)
But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:
red <- redDF %>%
arrange(id) %>% # be sure to apply the same order everywhere
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric)) # be sure to have numeric columns
nir <- nirDF %>% arrange(id) %>%
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric))
# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
ndvi <- redDF %>%
# get data.frame with ndvi values
transmute(round((nir - red)/(nir + red), digits = 4)) %>%
# bind id-column and possibly other columns to the data frame
bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>%
# place the id-column to the front
select(!starts_with('X'), everything())
}
As far as I have understood dplyr by now, it boils down to this:
across is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.
c_across on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.
Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.
While doing so, you need to keep some requirements into account:
The number of rows of the two data frames should be equal, otherwise, the shorter data frame will get recycled.
all columns in the data frame need to be of a numeric class (numeric or integer).
In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.
I have 5 data frames and I have to analize just the first column. From these, I must obtain a frequency table of their common words (not necessarily of all data frames, for example a word can appear just in two or more dataframes).
Then I must obtain a frequency table of common words of ALL dataframes
I just tried doing a for cycle but I seems very complicated. Moreover, dataframes have different dimentions. I didn't find any useful function.
Then I tried doing
lst1 <- list(a,b,c,d,e)
newdat <- stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))[2:1]
library(dplyr)
newdat %>% group_by(val) %>% filter(uniqueN(ind) > 1) %>% count(val)
but it gives me an error
> stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))
Error in stack.default(setNames(lapply(lst1, "[", 1), seq_along(lst1))):
at least one vector element is required
Thank you
Here's my solution using purrr & dplyr:
library(purrr)
library(dplyr)
lst1 <- list(mtcars=mtcars, iris=iris, chick=chickwts, cars=cars, airqual=airquality)
lst1 %>%
map_dfr(select, value=1, .id="df") %>% # select first column of every dataframe and name it "value"
group_by(value) %>%
summarise(freq=n(), # frequency over all dataframes
n_df=n_distinct(df), # number of dataframes this value ocurrs
dfs = paste(unique(df), collapse=",")) %>%
filter(n_df > 1) %>%
filter(n_df == 5) # if value has to be in all 5 dataframes
I have a Data with User_Name and Group.
User_Name Group
MustafE A
fischeta A
LosperS1 A
MustafE B
fischeta B
jose B
MustafE c
fischeta c
I want to flag those customers which are not repeating groups .. Example - 'LosperS1' is in group A but not in group B , same way 'jose' is in group B but not in group C, so in a new column they will be marked as "No In group B/No In group C"
Any help will be appreciated ..
Here is a way to get the output using tidyverse. Get the distinct elements of 'User_Name' column, loop through those elements (map), filter the rows of the dataset based on the presence of looped elements in 'User_Name', paste the elements that are not found in the 'Group' column when compared with the filtered 'Group', subset the first row (slice) and right_join with the original dataset. We used map_df to get the end output as a single data.frame instead of a list of data.frame
library(tidyverse)
df1 %>%
distinct(User_Name) %>%
pull(User_Name) %>%
map_df(~ df1 %>%
filter(User_Name == .x) %>%
mutate(Flag = toString(setdiff(unique(df1$Group),
unique(Group)))) %>%
slice(1) %>%
select(-Group)) %>%
right_join(df1, "User_Name")
I'm new in programming in R, and I've been having this problem for several days now. I started with a list, I created from splitting a file. This list contains a number of items in a single row.
head(sales2)
> $`7143443`
>>[1] "SSS-U-CCXVCSS1" "L-CCX-8GETTS-LIC"
>$`7208993`
>>[1] "NFFGSR4=" "1MV-FT-1=" "VI-NT/TE="
>$`7241758`
>>[1] "PW_SQSGG="
>$`9273628`
>>[1] "O1941-SE9" "CCO887VA-K9" "2901-SEC/K9" "CO1941-C/K9"
>$`9371709`
>>[1] "HGR__SASS=" "WWQTTB0S-L" "WS-RRRT48FP" "WTTTF24PS-L"
[5] "GEDQTT8TS-L" "WD-SRNS-2S-L"
>$`9830473`
>>[1] "SPA$FFSB0S"
I wanted it to convert into a data frame , I used
x<-do.call(rbind, lapply(sales2,data.frame))
It gets converted in the data frame ,but it converts like this
> head(x,6)
id
> 7143443.1 "SSS-U-CCXVCSS1"
> 7143443.2 "L-CCX-8GETTS-LIC"
> 7208993.1 "NFFGSR4="
> 7208993.2 "1MV-FT-1="
> 7208993.3 "VI-NT/TE="
> 7241758 "PW_SQSGG="
I want 7143443's all item in a single row not in multiple row
Through this I want to calculate how many rows contain 2 items together
for example "WS-C2960S-48TS-L" , "WS-C2960S-24TS-L",
these 2 elements are there in how many rows?
You can also say probability of these over all other elements.
I am not sure what is your final desired output. But the following script can convert your list to a data frame. Perhaps you can begin your analysis from this data frame.
# Create example list
sales2 <- list(`7143443` = c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC"),
`7208993` = c("NFFGSR4=", "1MV-FT-1=", "VI-NT/TE="),
`7241758` = "PW_SQSGG=",
`9273628` = c("O1941-SE9", "CCO887VA-K9", "2901-SEC/K9", "CO1941-C/K9"),
`9371709` = c("HGR__SASS=", "WWQTTB0S-L", "WS-RRRT48FP", "WTTTF24PS-L",
"GEDQTT8TS-L", "WD-SRNS-2S-L"),
`9830473` = "SPA$FFSB0S")
# Load packages
library(dplyr)
library(purrr)
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) %>% # Change the name of the second column
group_by(ID) %>% # Group by the first column
summarise(Value = paste0(Value, collapse = " ")) # Collapse the second column
dat
# A tibble: 6 × 2
ID Value
<chr> <chr>
1 7143443 SSS-U-CCXVCSS1 L-CCX-8GETTS-LIC
2 7208993 NFFGSR4= 1MV-FT-1= VI-NT/TE=
3 7241758 PW_SQSGG=
4 9273628 O1941-SE9 CCO887VA-K9 2901-SEC/K9 CO1941-C/K9
5 9371709 HGR__SASS= WWQTTB0S-L WS-RRRT48FP WTTTF24PS-L GEDQTT8TS-L WD-SRNS-2S-L
6 9830473 SPA$FFSB0S
Update
After reading original poster's comment, I decided to update my solution, to count how many rows contain two specified string patterns.
Here one row is a unique ID. So I assume that the request can be rephrased to "How many IDs contain two specified string patterns?" If this is the case, I would prefer not to collapse all the observations. Because after collapsing all observations to from one ID per row, we need to develop a strategy to match the string, such as using the regular expression. I am not familiar with regular string, so I will leave this for others to provide solutions.
In addition, the original poster did not specify which two strings are the targeted, so I would develop a strategy that the users can replace the targeted string case by case.
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) # Change the name of the second column
# After this, there is no need to collapse the rows
# Set the target string, User can change the strings here
target_string1 <- c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC")
dat2 <- dat %>%
filter(Value %in% target_string1) %>% # Filter rows matching the targeted string
distinct(ID, Value, .keep_all = TRUE) %>% # Only keep one row if ID and Value have exact duplicated
count(ID) %>% # Count how many rows per ID
filter(n > 1) %>% # Keep only ID that the Count number is larger than 1
select(ID)
dat2
# A tibble: 1 × 1
ID
<chr>
1 7143443