I want to "upgrade" my code by replacing mass-import using for loop with lapply function. After using lapply(list.files(), read.csv) I've got a list of dataframes. The problem is, the data is a bit messy and some things (like participant's sex) are mentioned only once, in one specific cell. It wasn't a problem when I used a for loop, as I could just refer to a specific cell. When I used:
for (x in list.files()) {
temp <- read.csv(x)
temp %>% slice(4:11) %>% select(form_2.index, form_2.response) %>% mutate(sex = temp[1,4])
#temp[1,4] is the one cell where the participant's sex is mentioned
database <- rbind(datadase, temp)
each temp variable looked like this:
form_2.index form_2.response sex$form.response
<dbl> <chr> <chr>
1 1 yes male
2 2 no male
3 3 no male
4 4 yes male
5 5 yes male
6 6 yes male
7 7 no male
8 8 no male
That's what I want. But how can I refer to a certain cell when using lapply? The following code doesn't work, as the temp variable is now a list:
temp <- lapply(list.files(), read_csv())
temp %>% lapply(slice, 4:11) %>% lapply(select, form_2.index, form_2.response) %>% lapply(mutate, plec = temp[1,4])
The slice and select functions work all right, the problem lies in the mutate part. Given it's a list, I need to point to a certain element of the list, not only column and row, but how can I do that? After all, I want it to be done in each element. Any ideas?
You can do :
library(dplyr)
temp <- lapply(list.files(), function(x) {
tmp <- readr::read_csv(x)
tmp %>%
slice(4:11) %>%
select(form_2.index, form_2.response) %>%
mutate(sex = tmp[1,4])
})
Related
I wrote a code to count the appearance of words in a data frame:
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df_main$words == item[i])}
word_freq <- data.frame(cbind(item, count))
word_freq
However, the results are like this:
item
count
1
decid*
0
2
head
1
3
heads
1
As you see, it does not correctly count for "decid*". The actual results I expect should be like this:
item
count
1
decid*
2
2
head
1
3
heads
1
I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!
I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.
I have used sapply as an alternative to for loop.
result <- stack(sapply(unique(df1$Items), function(x) {
if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
else sum(x == df_main$words)
}))
result
# values ind
#1 2 decid*
#2 1 head
#3 1 heads
Using tidyverse
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(count =sum(str_detect(df_main$words,
str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
ungroup
-output
# A tibble: 3 × 2
Items count
<chr> <int>
1 decid* 2
2 head 1
3 heads 1
Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr
EDIT:
Given the newly posted input data
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
and the OP's wish to have the matches in df_main, the solution might be this:
library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))
Result:
df_main
words Items_match
1 head 1
2 heads 1
3 decided 1
4 decides 1
5 top 0
6 undecided 1
I need to compare two rows next to each other in a column in a dataframe, if the data in both those rows matches, then save the most recent row, e.g.
# Animals
# 1 dog
# 2 cat
# 3 cat
It should compare dog and cat, then not save any data. So it won't save row 1 and 2.
But when it moves onto compare cat and cat, realise they are the same and save those rows. So save rows 2 and 3. As they are the same. There are several other columns but the animals column is the only one I need to use to decide whether the row is saved. However I want to keep all the data in the columns within the saved rows.
I need to do this for lots of rows, iterating through to compare a big set of data (~68,000)
I've tried to produce an if statement in which:
# results <- list()
#
# if(isTRUE(data$Animals[i+1] == data$Animals[i])) {
# output <- print(data$Animals[i+1])
# results[[i+1]] <- output
# output <- print(data$Animals[i])
# results[[i]] <- output
# }
#}
I then converted this results list into a dataframe for further manipulation. However this method only provides me with the animal name, I would prefer it the entire row was saved. I'm not too sure how to achieve this, I've been trying to edit the statement but I can't seem to get it working.
I'm new to R and learning, please help anyway you can, I'd appreciate it :)
To "prove" that we're saving the "most recent row", I'll add a row-number column. The data:
dat <- structure(list(Animals = c("dog", "cat", "cat"), row = 1:3), row.names = c(NA, -3L), class = "data.frame")
dat
# Animals row
# 1 dog 1
# 2 cat 2
# 3 cat 3
base R
dat[c(with(dat, Animals[-nrow(dat)] != Animals[-1])),,drop=FALSE]
# Animals row
# 1 dog 1
# 3 cat 3
dplyr
library(dplyr)
dat %>%
filter(Animals != lead(Animals, default = ''))
# Animals row
# 1 dog 1
# 2 cat 3
The only caution I have with this is that if package-loading is at all out-of-order, there exists both stats::filter and stats::lag that behave completely differently. If you see odd results, try prepending dplyr:: to make sure it isn't a which-function-am-I-using problem.
dat %>%
dplyr::filter(Animals != dplyr::lead(Animals, default = ''))
We could use lead and filter
library(dplyr)
df %>%
mutate(helper = lead(animals)) %>%
filter(animals == helper) %>%
select(animals)
Output:
animals
<chr>
1 cat
I´m new using R and I´ve been struggling using tidyverse.
I have created the following data.frame as example. My original data.frame has 180000 obs and 34 vars.
name <- c("chem1", "chem2", "chem3", "chem4", "chem5")
cas <- c("29331-92-5", "29331-92-6", NA, "29331-92-4", "29331-92-1" )
tib <- tibble(name, cas)
which generate this:
tib
# A tibble: 5 x 2
name cas
<chr> <chr>
1 chem1 29331-92-5
2 chem2 29331-92-6
3 chem3 NA
4 chem4 29331-92-4
5 chem5 29331-92-1
chem3 and chem1 must have same cas value, however the input file came with a NA value for chem3.
I do not know how to copy into the NA cell the cas value belonging to chem1, that is "29331-92-5".
Although I´ve trying using tidyverse but I am happy receiving any base feedback.
What I undestood: the value of chem3 was supposed to be equal to chem1, but your input file came with an error, because in a lot of lines, the values of chem3 are different from chem1.
To correct this, I would make a lookup vector, where the values of each "chem" are the correct values. Whem I make sure that all the values in this lookup vector are correct, I would just past this lookup vector, trough your tib data.frame. So to make this, I will first extract all the current unique values of each "chem" as follow:
library(tidyverse)
group <- tib %>%
group_by(name, cas) %>%
summarise(
count = n()
)
After that, I transform these unique values of cas, into a vector, and them, I name each of these values according to their respective chem. Since the values of "chem3" are incorrect, I need to equal this value, to the value of "chem1" before I proceed.
levels <- group$cas
names(levels) <- group$name
levels["chem3"] <- levels["chem1"]
Now that I correct the value of "chem3" in the lookup vector levels, I just ask R to repeat these values of levels, in the same order as they appear in your tib data.frame, and them I save this result in a new column trough the mutate() function.
tib <- tib %>%
mutate(
correct_cas = levels[tib$name]
)
Resulting this
# A tibble: 5 x 3
name cas correct_cas
<chr> <chr> <chr>
1 chem1 29331-92-5 29331-92-5
2 chem2 29331-92-6 29331-92-6
3 chem3 NA 29331-92-5
4 chem4 29331-92-4 29331-92-4
5 chem5 29331-92-1 29331-92-1
Hi I am trying to learn ways in which I can avoid loops in my codes.
I have an example data here:
options(warn=-1) #Turning warnings off here
Company=c("A","C","B","B","A","C","C","A","B","C","B","A")
CityID=as.character(c(1,1,1,2,2,2,3,3,3,4,4,4))
Value=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
Sales=c(1,1,0,0,0,1,1,0,1,0,1,0)
df=data.frame(Company,CityID,Sales,Value)
df$new_value=0
I also created a custom function (simple example only for testing purposes) as below.
funcCity12 = function(data){
data_new=data[which(data$CityID == '1'|data$CityID == '2'),]
for (i in 1:nrow(data_new)){
data_company=df[(df$Company)==data_new[i,'Company'] & !df$CityID==1 & !df$CityID==2,]
data_new[i,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
}
data_new
}
df2=funcCity12(data=df) # obtaining the result here
Now I am trying to write a function to avoid the for loop in the previous function.
funcCity12_no_loop = function(x,df){
data_company=df[(df$Company)==x[,'Company'] & !df$CityID==1 & !df$CityID==2,]
x[,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}
funcCity12_no_loop(x=df[1,],df=df) #Output for the first row of df1
This seems to be working when I input the rows individually. What I am stuck at is how to run this function for all rows of the dataframe. I am not sure if the 2nd function requires more changes for this purpose. Any help is appreciated. Thanks in advance.
P.S. For the second function, my initial reaction was to create a for loop and loop through the observations, but that defeats the whole purpose.
EDIT
This is based on #eonurk's answer
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
})
Output is shown below:
You can use apply function to reach out each individual observation of your dataframe.
For instance, you can multiplicate Values and Sales columns for no reason at all with following:
apply(df,1, function(x){ as.numeric(x["Sales"])*as.numeric(x["Value"])})
Edit:
Now you just need to use dplyr package
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}) %>% as.data.frame %>% t
Here is one way without a loop. First we filter based on your criteria, then we group by company and calculate the max, then we join the dataframe to the original dataset (also filtered based on your criteria). I didn't make it a function, but the building blocks are all there.
library(tidyverse)
list(
df %>%
filter(CityID %in% 1:2) %>%
select(-new_value),
df %>%
filter(! CityID %in% 1:2 & Sales == 1) %>%
group_by(Company) %>%
summarise(new_value = max(Value))
) %>%
reduce(full_join, by = "Company")
#> Company CityID Sales Value new_value
#> 1 A 1 1 120.5 NA
#> 2 C 1 1 123.0 123.2
#> 3 B 1 0 125.0 120.7
#> 4 B 2 0 122.5 120.7
#> 5 A 2 0 122.1 NA
#> 6 C 2 1 121.7 123.2
I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.
library(tidyverse)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))
df <- data.frame()
for(i in seq_along(episodes)) {
for(j in seq_along(dx_names)) {
df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
}
}
names(df) <- dx_names
df
#> item1 item2
#> 1 A 1
#> 2 B 2
Created on 2019-09-19 by the reprex package (v0.3.0)
Thank you in advance.
This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:
library(tidyr)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))
#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))
#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)
#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)
#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)
# A tibble: 2 x 3
episodes item1 item2
<int> <chr> <chr>
1 1 A 1
2 2 B 2
This may be an option without using a for loop,
episodes <- xml_find_all(dx, xpath = "//EPISODE") %>% xml_attr("item1")
dx_names <- xml_name(xml_children(episodes[1]))
# You can get all values between the tags by xml_text()
values <- xml_children(episodes) %>% xml_text()
as.data.frame(matrix(values,
ncol=length(dx_names),
dimnames =list(seq(dx_names),dx_names),byrow=TRUE))
gives,
item1 item2
1 A 1
2 B 2
Note that, you may need to change the Item2 column to a numeric one by as.numeric() since it's been assigned as factor by this solution.