I have two big datasets, I would like to subset some columns in order to use the data.
My problem is that the reference column for subsetting is not completely matching. So I would like to be able to match for the part of the strings that are the same.
Here a simpler example:
ref_df <- data.frame("reference" = c("swietenia macrophylla",
"azadirachta indica",
"cedrela odorata",
"ochroma pyramidale",
"tectona grandis",
"tamarindus indica",
"cariniana pyriformis",
"paquita quinata",
"albizia saman",
"enterolobium cyclocarpum",
"tapirira guianensis",
"dipteryx oleifera"),
"values" = c(rnorm(12)))
tofind_df <- c("swietenia macrophylla and try try",
"azadirachta indica",
"tamarindus indica (bla bla)",
"tara",
"bla bla (paquita quinata)",
"prosopis pallida",
"dipteryx oleifera")
So I try to keep all the values of ref_df that have a name that matches even partially in tofond_df, but it only matches if they are the same.
finale <- ref_df[ref_df$reference %in% tofind_df$names,]
I tried with grepl as well, but I couldn't find the solution.
My ideal finale should look like this:
reference values
1 swietenia macrophylla -0.459001383
2 azadirachta indica -0.430014486
3 tamarindus indica -0.541887328
4 paquita quinata -0.003572792
5 dipteryx oleifera -0.855659901
Please, think about two big df and not this easier situation.
We need to use sapply to get the results from grepl for every element
ref_df[sapply(ref_df$reference, function(x) any(grepl(x, tofind_df))),]
reference values
1 swietenia macrophylla 1.4482830
2 azadirachta indica 0.9037943
6 tamarindus indica -0.2994678
8 paquita quinata 0.4895183
12 dipteryx oleifera -1.1652528
You can use group_by and filter from dplyr and str_detect from stringr:
library(dplyr)
library(stringr)
ref_df %>%
mutate(reference = as.character(reference))%>% #reference is factor. Making it a character
group_by(reference) %>%
filter(any(str_detect(tofind_df,reference)))%>% #Finding if there are any matches between each reference name and any of the strings in the tofind_df
ungroup()
# A tibble: 5 x 2
reference values
<chr> <dbl>
1 swietenia macrophylla -0.456
2 azadirachta indica -1.08
3 tamarindus indica -0.428
4 paquita quinata -0.937
5 dipteryx oleifera 0.816
Related
I'm dealing with a quite complicated data frame. I'm trying to rename its columns by adding some terminal digits to each column several times. How can I do that?
Let me make an example:
df=data.frame(disease=c("HB1","HB2","HB3","HB4"),
region="AZ",
hospitalAZ=runif(4),
hospitalAZ=runif(4),
hospitalAZ=runif(4),
hospitalAZ=runif(4))
This is just a stupid example. The outcome should be: the columns after "region" should be named HospitalAZ1, HospitalAZ2, HospitalAZ1, HospitalAZ2. I looking for a parsimonious way of adding, in this case, 1 and 2 to the 4 columns with repetition (2 times in this case). Then, how can I extract the outcome in an xls file?
We could use rename_with
library(dplyr)
library(stringr)
df <- df %>%
rename_with(~ make.unique(str_c("HospitalAZ", rep(1:2,
length.out = length(.x)))), starts_with("hospitalAZ"))
-output
disease region HospitalAZ1 HospitalAZ2 HospitalAZ1.1 HospitalAZ2.1
1 HB1 AZ 0.1796734 0.28729264 0.8549300 0.8486733
2 HB2 AZ 0.8518319 0.03438504 0.5909983 0.8378173
3 HB3 AZ 0.3961885 0.67294967 0.4627137 0.5484321
4 HB4 AZ 0.9955195 0.38767387 0.1961428 0.6010028
NOTE: matrix can have duplicate column names, but data.frame duplicate column names are not recommended and in tidyverse the duplicates can result in error
In base R, we may do
i1 <- startsWith(names(df), "hospitalAZ")
names(df)[i1] <- paste0("HospitalAZ", rep(1:2, length.out = sum(i1)))
> df
disease region HospitalAZ1 HospitalAZ2 HospitalAZ1 HospitalAZ2
1 HB1 AZ 0.1796734 0.28729264 0.8549300 0.8486733
2 HB2 AZ 0.8518319 0.03438504 0.5909983 0.8378173
3 HB3 AZ 0.3961885 0.67294967 0.4627137 0.5484321
4 HB4 AZ 0.9955195 0.38767387 0.1961428 0.6010028
A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!
You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33
I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.
this is a example data frame and I just want to se if their any unction that can find duplicates , using partial string matching.
df
name last
Joseph Smith
Jose Smith
Joseph Smit
Maria Cruz
maria cru
Mari Cruz
Data Prep
Using dplyr, first concatenate the first and last name into a whole name
library(dplyr)
df1 <- df %>%
rowwise() %>% # rowwise operation
mutate(whole=paste0(name,last,collapse="")) # concatenate first and last name by row
ungroup() # remove rowwise grouping
Output
name last whole
1 Joseph Smith JosephSmith
2 Jose Smith JoseSmith
3 Joseph Smit JosephSmit
4 Maria Cruz MariaCruz
5 maria cru mariacru
6 Mari Cruz MariCruz
Grouping similar strings
This recursive function will use agrepl, logical approximate grep, to find related string groups, and group and label them grp. NOTE The tolerance to differences among strings is set by max.distance. Lower numbers are more stringent
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$whole[1], y$whole, max.distance=0.4) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(grp=grp)) # save similar strings
y <- setdiff(y, y[similar,]) # remaining non-similar strings
special(x, y, grp+1) # run function again on non-similar strings
}
}
desired <- special(desired, df1, grp)
Output
name last whole grp
1 Joseph Smith JosephSmith 1
2 Jose Smith JoseSmith 1
3 Joseph Smit JosephSmit 1
4 Maria Cruz MariaCruz 2
5 maria cru mariacru 2
6 Mari Cruz MariCruz 2
To get rid of whole
df2 <- df1 %>% select(-whole)
I have a numeric vector:
> dput(vec_exp)
structure(c(12.344902729712, 6.54357482855349, 17.1939193108764,
23.1029632631654, 8.91495023159554, 14.3259091357051, 18.0494234749187,
2.92524638658168, 5.10306474037357, 2.66645609602021), .Names = c("Arthur_1",
"Mark_1", "Mark_2", "Mark_3", "Stephen_1", "Stephen_2",
"Stephen_3", "Rafael_1", "Marcus_1", "Georg_1"))
and then I have a data frame like the one below:
Name Nr Numb
1 Rafael 20.8337 20833.7
2 Joseph 25.1682 25168.2
3 Stephen 40.5880 40588.0
4 Leon 198.7730 198773.0
5 Thierry 16.5430 16543.0
6 Marcus 31.6600 31660.0
7 Lucas 39.6700 39670.0
8 Georg 194.9410 194941.0
9 Mark 60.1020 60102.0
10 Chris 56.0578 56057.8
I would like to multiply the numbers in numeric vector by the numbers from the column Nr in this data frame. Of course it is important to multiply the values by the name. It means that Mark_1 from numeric vector should be multiplied by the Nr = 60.1020, same for Mark_2, and Stephen_3 by 40.5880, etc.
Can someone recommend any easy solution ?
You could use match to match the names after extracting only the first part of the names of vec_exp, i.e. extract Mark from Mark_1 etc.
vec_exp * df$Nr[match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)]
# Arthur_1 Mark_1 Mark_2 Mark_3 Stephen_1 Stephen_2 Stephen_3 Rafael_1 Marcus_1 Georg_1
# NA 393.28193 1033.38894 1388.53430 361.84000 581.46000 732.59000 60.94371 161.56303 519.80162
Arthur is NA because there's no match in the data.frame.
If you want to keep those entries without a match in the data as they were before, you could do it like this:
i <- match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)
vec_exp[!is.na(i)] <- vec_exp[!is.na(i)] * df$Nr[na.omit(i)]
This first computes the matches and then only multiplies those if they are not NA.
We can use base R methods. Convert the vector to a data.frame with stack, create a 'Name' column by removing the substring from 'ind' and merge with the data.frame ('df1'). Then, we can multiply the 'Nr' and the 'values' column.
d1 <- merge(df1, transform(stack(vec_exp), Name = sub("_.*", "", ind)), all.y=TRUE)
d1$Nr*d1$values
Or with dplyr, it is much more easier to understand.
library(dplyr)
library(tidyr)
stack(vec_exp) %>%
separate(ind, into = c("Name", "ind")) %>%
left_join(., df1, by = "Name") %>%
mutate(res = values*Nr) %>%
.$res
#[1] NA 393.28193 1033.38894 1388.53430 361.84000
#[6] 581.46000 732.59000 60.94371 161.56303 519.80162