concatenating only vector values from a row - r

I have a problem with my R code. At first I have a dataframe (df) with one column which consists of numerical values as well as vectors. These vectors also contain numerical values. This is an example of some rows of the dataframe:
1. 60011000
2. 60523000
4. 60490000
5. 60599000
6. c("60741000", "60740000", "60742000")
7. 60647000
8. c("60766000", "60767000")
9. c("60563000", "60652000")
In the list you can see there are some rows (6, 8 & 9) containing vector elements. I want to concatenate the elements in the vectors to only one element.
For example the result from the vector of line 6 should look like this:
607410006074000060742000
And the result of line 8 should look like this
6076600060767000
My dataframe has more than 30,000 rows so it is impossible for me to do it manually.
Can you help me to solve my problem? It is important that the number of rows does not change.
Thank you very much and please excuse mistakes i made. I am not a native speaker.

The data:
dat <- read.table(text='60011000
60523000
60490000
60599000
c("60741000", "60740000", "60742000")
60647000
c("60766000", "60767000")
c("60563000", "60652000")', sep = "\t")
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 c(60741000, 60740000, 60742000)
# 6 60647000
# 7 c(60766000, 60767000)
# 8 c(60563000, 60652000)
You can use gsub to replace all non-digit characters with the empty string.
dat$V1 <- gsub("[^0-9]+", "", dat$V1)
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 607410006074000060742000
# 6 60647000
# 7 6076600060767000
# 8 6056300060652000

You could do:
df=data.frame(a=c(1,2,3,4,'c("60741000", "60740000", "60742000")'),
b=c(1,2,3,4,5),
stringsAsFactors = F)
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 c("60741000", "60740000", "60742000") 5
df[,"a"]=sapply(df[,"a"],function(x) paste(eval(parse(text=x)),collapse = ""))
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 607410006074000060742000 5

Here you go; (looks like someone beat me to the punch )
df <- read.table("df.txt",header=F,)
df
# V1
# 1 123
# 2 12
# 3 c("1","55","6")
# 4 356
# 5 c("99","55","3")
df[,1] <- as.numeric(as.character(gsub("[^0-9]","",df[,1])))
df
# V1
# 1 123
# 2 12
# 3 1556
# 4 356
# 5 99553

Related

R - How to create multiple datasets based on levels of factor in multiple columns?

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.
This is my dataset:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
And I need each column to be a factor:
df[colnames(df)] <- lapply(df[colnames(df)], factor)
Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies.
I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.
The first two dataframes should look something like this:
Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"
Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"
I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).
I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:
Splitting with a list of columns doesn't work
list_df <- split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")
Can you help? Thanks in advance!
You need two lapplys:
vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
# A B C D E dummy1
# 3 2 2 3 5 2 no
# 4 3 3 3 5 3 no
# 5 4 4 4 5 4 no
# 6 5 2 2 4 2 no
# 8 1 5 1 5 1 no
#
# $dummy1$yes
# A B C D E dummy1
# 1 1 4 3 1 1 yes
# 2 2 4 3 2 4 yes
# 7 1 1 5 5 5 yes
# 9 2 2 2 2 2 yes
# 10 3 2 3 3 3 yes
#
#
# $dummy2
# $dummy2$high
# A B C D E dummy2
# 1 1 4 3 1 1 high
# 5 4 4 4 5 4 high
# 6 5 2 2 4 2 high
# 7 1 1 5 5 5 high
# 10 3 2 3 3 3 high
#
# $dummy2$low
# A B C D E dummy2
# 2 2 4 3 2 4 low
# 3 2 2 3 5 2 low
# 4 3 3 3 5 3 low
# 8 1 5 1 5 1 low
# 9 2 2 2 2 2 low
For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].
For programming purposes it is usually better to keep the list intact since it makes it simple to write code that processes all of the data frames in the list without having to specify them individually.
You are very close:
tbls <- unlist(step2, recursive=FALSE)
list2env(tbls, envir=.GlobalEnv)
ls()
# [1] "df" "dummies" "dummy1.no" "dummy1.yes" "dummy2.high" "dummy2.low" "step1" "step2" "tbls" "vals"
This will create the same set of tables.

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

How can I reshape my dataframe?

I have a huge data frame, that in a simple version it looks like this:
trials=c("1","2","3","4","5","6","7","8","9","10")
co =c(rep ("1",10))
stim=c("8","9","11","2","4","7","8","1","12","16")
ansbin=c("1","0","1","0","0","1","0","1","1","0")
stim.1=c("11","2","11","7","4","3","9","1","4","16")
ansbin.1=c("0","0","1","0","0","1","0","1","1","1")
trials.1=c("1","2","3","4","5","6","7","8","9","10")
co.1 =c(rep ("2",10))
stim1.1=c("11","2","11","2","5","7","8","15","17","10")
ansbin1.1=c("1","1","1","0","0","1","1","1","0","1")
stim2.1=c("11","2","14","1","4","8","9","10","4","12")
ansbin2.1=c("0","1","1","0","0","1","0","0","1","0")
ID<- data.frame(trials,co,stim,ansbin,stim.1,ansbin.1,trials.1,co.1,stim1.1,ansbin1.1,stim2.1,ansbin2.1)
View(ID)
Now I would like to form my new data.frame in the way that "stim", "stim.1","stim1.1" and "stim2.1" are under the same column called "stimulus", and the same thing for the answers: I would like all "ansbin", "ansbin.1", "ansbin1.1" and "ansbin2.1" under the same column called "answers".
Trials and Trials.1 at the same time should be under the same column, but the difference will the "co" column.
I tryied to use "reshape" like this:
df<-reshape(ID, direction="long",
idvar=c("trials", "co"),
varying= c("stim","stim.1", "stim1.1","stim2.1","ansbin","ansbin.1","ansbin1.1","ansbin2.1"
v.names=c("stimulus","answer"),
timevar="num",
)
but I have some problems and warning at the everytimes. I think it should be a problem linked to columns's name.
Can you help me?
Thank you in advance! :)
Here's the approach I would take:
library(data.table)
melt(
rbindlist(split.default(ID, cumsum(grepl("^trials", names(ID))))),
measure.vars = patterns("^stim", "^ansbin"), value.name = c("stim", "ansbin"))
# trials co variable stim ansbin
# 1: 1 1 1 8 1
# 2: 2 1 1 9 0
# 3: 3 1 1 11 1
# 4: 4 1 1 2 0
# 5: 5 1 1 4 0
# ---
# 36: 6 2 2 8 1
# 37: 7 2 2 9 0
# 38: 8 2 2 10 0
# 39: 9 2 2 4 1
# 40: 10 2 2 12 0
Basically, it sounds like you're looking at two rounds of "reshaping".
Stacking the columns from "trials" to the second set of "ansbin" on top of each other. I've done that with the rbindlist(split.default(...)) part of my answer.
Stacking each resulting pair of "stim" and "ansbin" columns on top of each other. I've done that with the melt(...) part of my answer.
Consider building a list of reshaped dataframes for each set: co, trials, stimulus, and answers, then merge them together. However, because co and trials only carry two columns while latter two carries four columns consider repeating columns prior to reshaping:
ID$co2 <- ID$co
ID$co3 <- ID$co.1
ID$trials.2 <- ID$trials
ID$trials.3 <- ID$trials.1
df_list <- lapply(c("co", "trials", "stim", "ans"), function(s)
reshape(ID, direction="long",
varying= grep(s, names(ID)),
v.names=c(s),
drop = grep(paste0("^", s), names(ID), invert=TRUE),
timevar="num",
new.row.names = 1:1000)
)
# CHAIN MERGE
finaldf <- Reduce(function(x, y) merge(x, y, by=c('id', 'num')), df_list)
finaldf <- with(finaldf, finaldf[order(num, id),]) # SORT DATAFRAME
rownames(finaldf) <- NULL # RESET ROWNAMES
head(finaldf)
# id num co trials stim ans
# 1 1 1 1 1 8 1
# 2 2 1 1 2 9 0
# 3 3 1 1 3 11 1
# 4 4 1 1 4 2 0
# 5 5 1 1 5 4 0
# 6 6 1 1 6 7 1

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Resources