Related
I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))
I have some raw data from an instrument that I need to get into shape, so I can work with it. I guess it's an easy fix, but I need some help.
The df looks like this:
date <-c("Date1","Date2","Date3")
data <-c("1,234,567,2345;2,345,5677,256;3,576,345,3456", "1,564,567,2345;2,745,5677,256;3,577,345,8456", "1,234,567,2345;2,345,5677,256,;3,555,345,3456;....")
df<-data.frame(date, data)
So for every date there are size classes (1,2 or 3..) with three corresponding measurements/values, separated by a ",". The different size classes are separated by a ";".
Now i would like to access the first value per size class and transfer it into a new df. So that should look like:
date<-c("Date1","Date2","Date3")
data_sizeclass_1<-c(234,564,234)
data_sizeclass_2<-c(345,745,345)
data_sizeclass_3<-c(576,577,555)
df<-data.frame(date,data_sizeclass_1,data_sizeclass_2,data_sizeclass_3)
I hope that it makes sense.
So far I managed to separate the data column into separate columns (with cSplit). I would be able to assemble the columns into a new df, but then I would have to choose each column manually, and since I have more than 200 size classes that would be a lot of work. I would like to find a solution to directly extract only one of the measurements into a new df.
Thanks for your help.
Does this work?
library(tidyr)
library(dplyr)
library(stringr)
df %>%
# first get rid of digit at string beginning and, respectively, after semi-colon:
mutate(data = gsub("^\\d,|(?<=;)\\d,", "", data, perl = TRUE)) %>%
# separate into columns using semi-colon as separator:
separate(data,
into = paste0("data_sizeclass_", 1:3),
sep = ";",
convert = TRUE) %>%
# finally extract the initial numbers in the new columns:
mutate(across(c(-date), ~ str_extract(., "\\d+")))
date data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1 Date1 234 345 576
2 Date2 564 745 577
3 Date3 234 345 555
Maybe you can try the base R code below
with(
df,
cbind(
df,
`colnames<-`(
t(sapply(
strsplit(data, ";"),
function(x) as.numeric(regmatches(x, regexpr("(?<=,)\\d+", x, perl = TRUE)))
)),
paste0("data_sizeclass_", seq(nrow(df)))
)
)
)
which gives
date data data_sizeclass_1
1 Date1 1,234,567,2345;2,345,5677,256;3,576,345,3456 234
2 Date2 1,564,567,2345;2,745,5677,256;3,577,345,8456 564
3 Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456;.... 234
data_sizeclass_2 data_sizeclass_3
1 345 576
2 745 577
3 345 555
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#find the amount of columns which need to be created (max nr of different sizeclasses)
colnr <- max(lengths(strsplit(df$data, ";")))
#create columns for each sizeclass
for(i in 1:colnr ){
dt[, paste0("data_sizeclass_", i) := sapply(strsplit(data, ";"), `[`, i)]
}
dt
date data data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1: Date1 1,234,567,2345;2,345,5677,256;3,576,345,3456; 1,234,567,2345 2,345,5677,256 3,576,345,3456
2: Date2 1,564,567,2345;2,745,5677,256;3,577,345,8456; 1,564,567,2345 2,745,5677,256 3,577,345,8456
3: Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456; 1,234,567,2345 2,345,5677,256, 3,555,345,3456
In the second part we use strsplit to extract the second element of data
#pivot longer
dt <- melt(dt, measure.vars = paste0("data_sizeclass_", 1:colnr))
#extract second element of strsplit, using , as separator
dt[, value := sapply(strsplit(value, ","), `[`, 2)]
#pivot wider (to original shape)
dt <- dcast(dt, date+data ~ variable, value.var = "value")
dt
date data data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1: Date1 1,234,567,2345;2,345,5677,256;3,576,345,3456; 234 345 576
2: Date2 1,564,567,2345;2,745,5677,256;3,577,345,8456; 564 745 577
3: Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456; 234 345 555
I have eight data frames, all of which contain an id field and I want to know if any of the id values are common among all eight data frames.
I'm not looking for an intersection (where the values are common across all data frames); I simply want to know those instances where they appear in any of the other data frames.
Let's say that one of the data frames looks like this:
id TestDay
1 66 m
2 90 t
3 71 w
4 59 th
5 38 f
6 84 sa
7 15 su
8 89 m
9 18 t
10 93 w
11 88 th
12 42 f
13 10 sa
14 33 su
15 49 m
16 51 t
17 80 w
18 32 th
19 1 f
20 91 sa
21 58 su
If you wish to create eight sample data frames, you can do so by using this code eight times (with different data frame names, naturally):
x <- data.frame(id = sample(1:100, 21, FALSE), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
I want to know if any of the id values listed here appear in any of the other seven data frames, and conversely, whether any of the id values listed in any of the other seven data frames exist in this one.
How can this be done?
Combine all the dataframes in one dataframe with a unique id value which will distinguish each dataframe.
I created two dataframes here with data column representing the dataframe number.
library(dplyr)
x1 <- data.frame(id = round(runif(21, 1, 21)), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
x2 <- data.frame(id = round(runif(21, 1, 21)), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
combine_data <- bind_rows(x1, x2, .id = 'data')
group by the id column and count how many dataframes that id is present in.
combine_data %>%
group_by(id) %>%
summarise(count_unique = n_distinct(data))
You can add filter(count_unique > 1) to the above chain to get id's which are present in more than 1 dataframe.
To add to #Ronak 's answer, you can also concatenate c() the dataframe number using summarise(). This tells you which dataframe the ID comes from.
df1 <- data.frame(id = letters[1:3])
df2 <- data.frame(id = letters[4:6])
df3 <- data.frame(id = letters[5:10])
library(tidyverse)
df <- list(df1, df2, df3)
df4 <- df %>%
bind_rows(.id = "df_num") %>%
mutate(df_num = as.integer(df_num)) %>%
group_by(id) %>%
summarise(
df_found = list(c(df_num)),
df_n = map_int(df_found, length)
)
df4
We can use data.table methods
Get the datasets in a list and rbind them with rbindlist
Grouped by 'id' get the count of unique 'data' with uniqueN
library(data.table)
rbindlist(list(x1, x2, idcol = 'data')[, .(count_unique = uniqueN(data)), by = id]
For example, I have a data table with several columns:
column A column B
key_500:station and loc 2
spectra:key_600:type 9
alpha:key_100:number 12
I want to split the rows of column A into components and create new columns, guided by the following rules:
the value between "key_" and ":" will be var1,
the next value after ":" will be var2,
the original column A should retain the part of string that is prior to ":key_". If it is empty (as in the first line), then replace "" with an "effect" word.
My expected final data table should be like this one:
column A column B var1 var2
effect 2 500 station and loc
spectra 9 600 type
alpha 12 100 number
Using tidyr extract you can extract specific part of the string using regex.
tidyr::extract(df, columnA, into = c('var1', 'var2'), 'key_(\\d+):(.*)',
convert = TRUE, remove = FALSE) %>%
dplyr::mutate(columnA = sub(':?key_.*', '', columnA),
columnA = replace(columnA, columnA == '', 'effect'))
# columnA var1 var2 columnB
#1 effect 500 station and loc 2
#2 spectra 600 type 9
#3 alpha 100 number 12
If you want to use data.table you can break this down in steps :
library(data.table)
setDT(df)
df[, c('var1', 'var2') := .(sub('.*key_(\\d+).*', '\\1',columnA),
sub('.*key_\\d+:', '', columnA))]
df[, columnA := sub(':?key_.*', '', columnA)]
df[, columnA := replace(columnA, columnA == '', 'effect')]
data
df <- structure(list(columnA = c("key_500:station and loc",
"spectra:key_600:type", "alpha:key_100:number"),
columnB = c(2L, 9L, 12L)), class = "data.frame", row.names = c(NA, -3L))
You can use separate which uses non-letters and separates the string into columns defined in into
require(tidyr)
require(dplyr)
df=tribble(
~"column A",~"column B",
"key_500:station", 2,
"spectra:key_600:type", 9,
"alpha:key_100:number", 12)
df %>% separate("column A",into=c('column A','key','var1','var2'),fill='left') %>% select(-key) %>% select("column A","column B",var1,var2) %>%
mutate(`column A`=ifelse(is.na(`column A`),"effect",`column A`))
And this is a modified version to work with data.tables
require(tidyr)
require(data.table)
DT=data.table(
"column A"=
c("key_500:station and loc",
"spectra:key_600:type",
"alpha:key_100:number"),
"column B"=c(2,9,12))
DT=separate(sep = "[^[:alnum:] ]+",DT,"column A",into=c('column A','key','var1','var2'),fill='left')
DT$key=NULL
DT$`column A`=ifelse(is.na(DT$`column A`),"effect",DT$`column A`)
DT=DT[,c(1,4,2,3)]
I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])